Statistics of our training and test set. It can be seen that out-of-domain data is generally long sentences, which is a challenge for short-text LID in query scenarios. The synthetic in-domain data acquired through data enhancement can fill the domain gap of the data set.
. | Dataset . | Sentences . | Tokens per sentence . | Characters per sentence . |
---|---|---|---|---|
Train | Out-of-Domain | 42M | 13.05 | 72.27 |
In-Domain | 42M | 2.92 | 18.32 | |
Test | QID-21 | 21,440 | 2.56 | 15.53 |
KB-21 | 2,100 | 4.47 | 34.90 |
. | Dataset . | Sentences . | Tokens per sentence . | Characters per sentence . |
---|---|---|---|---|
Train | Out-of-Domain | 42M | 13.05 | 72.27 |
In-Domain | 42M | 2.92 | 18.32 | |
Test | QID-21 | 21,440 | 2.56 | 15.53 |
KB-21 | 2,100 | 4.47 | 34.90 |