Table 1

Statistics of our training and test set. It can be seen that out-of-domain data is generally long sentences, which is a challenge for short-text LID in query scenarios. The synthetic in-domain data acquired through data enhancement can fill the domain gap of the data set.

DatasetSentencesTokens per sentenceCharacters per sentence
Train Out-of-Domain 42M 13.05 72.27 
In-Domain 42M 2.92 18.32 
 
Test QID-21 21,440 2.56 15.53 
KB-21 2,100 4.47 34.90 
DatasetSentencesTokens per sentenceCharacters per sentence
Train Out-of-Domain 42M 13.05 72.27 
In-Domain 42M 2.92 18.32 
 
Test QID-21 21,440 2.56 15.53 
KB-21 2,100 4.47 34.90 
Close Modal

or Create an Account

Close Modal
Close Modal