Compositional Evaluation on Japanese Textual Entailment and Similarity

Abstract Natural Language Inference (NLI) and Semantic Textual Similarity (STS) are widely used benchmark tasks for compositional evaluation of pre-trained language models. Despite growing interest in linguistic universals, most NLI/STS studies have focused almost exclusively on English. In particular, there are no available multilingual NLI/STS datasets in Japanese, which is typologically different from English and can shed light on the currently controversial behavior of language models in matters such as sensitivity to word order and case particles. Against this background, we introduce JSICK, a Japanese NLI/STS dataset that was manually translated from the English dataset SICK. We also present a stress-test dataset for compositional inference, created by transforming syntactic structures of sentences in JSICK to investigate whether language models are sensitive to word order and case particles. We conduct baseline experiments on different pre-trained language models and compare the performance of multilingual models when applied to Japanese and other languages. The results of the stress-test experiments suggest that the current pre-trained language models are insensitive to word order and case marking.


Introduction
Natural Language Inference (NLI) (Dagan et al., 2006;Bowman et al., 2015) and Semantic Textual Similarity (STS) (Agirre et al., 2016) tasks are well-positioned to serve as a basic benchmark for natural language understanding.With the recent progress of deep neural networks including pre-trained language models such as BERT (Devlin et al., 2019), the development of benchmark datasets has centered on large crowdsourced English datasets, such as SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018).Since there has been an increasing need for benchmark datasets in linguistic universals (Linzen, 2020), general language understanding frameworks including NLI and STS for languages other than English have been provided (Conneau et al., 2018;Liang et al., 2020;Le et al., 2020;Shavrina et al., 2020;Xu et al., 2020;Seelawi et al., 2021;Park et al., 2021).
Another recent line of work has investigated whether models are sensitive to shuffled word order, but the conclusions are controversial (Ravfogel et al., 2019;Sinha et al., 2021a,b;Gupta et al., 2021;Pham et al., 2021;White and Cotterell, 2021).One characteristic of human-like language understanding is that humans can understand sentences according to their word meanings and syntactic structures, then recognize their semantic relationships (Frege, 1963;Katz and Fodor, 1963;Montague, 1973;Janssen and Partee, 1997).Since previous work has demonstrated the usefulness of analyzing the generalization ability of models in challenging NLI in English (Naik et al., 2018;Glockner et al., 2018;McCoy et al., 2019;Rozen et al., 2019;Goodwin et al., 2020;Yanaka et al., 2021), we should continue this line of research in other languages.
Against this background, we provide a Japanese NLI/STS dataset to analyze language models in compositional inference across languages.Our motivations for focusing on Japanese are two-fold.
First, Japanese is a high-resource language that has typologically different characteristics from English (Joshi et al., 2020), yet it has not been included in previous cross-lingual (Real et al., 2018;Hu et al., 2020;Ham et al., 2020;Wijnholds and Moortgat, 2021) or multilingual (Conneau et al., 2018) NLI datasets.This raises the question of whether models perform inference differently in Japanese and other languages.
Second, Japanese has case markers and free word-order (Hinds, 1986;Shibatani, 1990), phenomena that pose interesting challenges for multilingual NLI.While shuffling data usually changes its meaning, the meaning of a Japanese sentence can be preserved even when the order of noun phrase (NP) arguments is swapped.By analyzing model behavior with scrambling phenomena that preserve case relations in a sentence and particle-swapping phenomena that change case relations, we can analyze whether the model can distinguish transformations that change sentence meanings and perform compositional inferences.
This paper has three contributions.
First, we provide JSICK as a compositional Japanese NLI/STS dataset by manually translating the English SICK dataset (Marelli et al., 2014).Compared with recent crowdsourced NLI datasets, SICK facilitates identification of which compositional linguistic phenomena are key to a given inference.Such a controlled structure is suited to transforming sentences for further analyses of model behavior.In addition, SICK has been translated into non-English languages (Real et al., 2018;Wijnholds and Moortgat, 2021), allowing cross-language comparisons on a sizeable parallel NLI corpus.
Second, we create a stress-test dataset for JSICK to investigate whether language models capture word order and case particles in Japanese.We created the stress-test dataset by transforming syntactic structures of JSICK sentence pairs, where we analyze whether models consider word order and case particles when predicting entailment labels and similarity scores.
Third, for the baseline evaluation of pre-trained language models, we compare performance between different pre-trained language models on JSICK.We also compare the performance of multilingual pre-trained language models on SICK datasets of different languages, including JSICK.We also provide an in-depth analysis of sensitivity to word order and case particles based on the JSICK stress-test dataset.The analysis results suggest that both Japanese and multilingual models are surprisingly inattentive to word order and case marking.Our dataset will be publicly available at https://github.com/verypluming/JSICK.

Related Work
Standard NLI benchmarks have been mainly developed for English.Recently, large crowdsourced NLI datasets derived from image captions, such as SNLI (Bowman et al., 2015), and those targeting multi-genre sentences, like MultiNLI (Williams et al., 2018), have been widely used to evaluate neural models.
For linguistics-oriented datasets, Fra-CaS (Cooper et al., 1994) is a manually collected NLI test set involving linguistic phenomena studied in formal semantics, and SICK (Marelli et al., 2014) is a larger and more naturalistic NLI/STS dataset made from captions focusing on compositional inference.Unlike SNLI and MultiNLI, SICK was designed by linguistic experts so as to not require dealing with aspects beyond the scope of compositional inference (e.g., world knowledge, named entities, and multiword expressions) but to cover a variety of combinations of lexical, syntactic, and semantic phenomena.The SICK dataset thus allows systematic assessments of the reasoning ability of models on compositional inference.For STS, the SemEval 2012-2017 (Agirre et al., 2016;Cer et al., 2017) competitions provided English, Arabic, and Spanish STS datasets including SICK.
With the development of multilingual pretrained language models, general language understanding frameworks for languages other than English have been created (Liang et al., 2020;Le et al., 2020;Shavrina et al., 2020;Xu et al., 2020;Seelawi et al., 2021;Park et al., 2021), and NLI datasets have been multilingualized.Conneau et al. (2018) provided a cross-lingual NLI (XNLI) corpus by translating MultiNLI into 15 languages, including languages with few language resources such as Swahili and Urdu and languages with flexible word order such as Russian and German.Ham et al. (2020) translated MultiNLI into Korean to create KorNLI.However, Japanese is not included in these datasets.In addition, since sentences in MultiNLI are usually longer than those in SICK and contain multiword expressions beyond the scope of compositional inference, it is unrealistic to carefully transform syntactic structures of sentence pairs in XNLI to create a stress-test dataset like ours.
As examples of other non-English datasets, OCNLI (Hu et al., 2020) is a Chinese NLI dataset built from original multi-genre resources.
FarsTail (Amirkhani et al., 2020) is a Persian NLI dataset containing sentences from university exams.There have also been attempts to translate the SICK dataset into Portuguese (Real et al., 2018) and Dutch (Wijnholds and Moortgat, 2021), so our Japanese SICK dataset will contribute to a multilingual SICK dataset that will allow controlled, cross-lingual analyses of the compositional abilities of language models.Regarding Japanese NLI datasets, a Japanese SNLI dataset (Yoshikoshi et al., 2020) was constructed by using machine translation to translate the English SNLI dataset into Japanese and automatically filtering out unnatural sentences, but methods that employ machine translation are still problematic in that they can produce unnatural sentences.The Japanese Realistic Textual Entailment Corpus (Hayashibe, 2020) is a crowdsourced dataset containing Japanese hotel reviews.However, linguistic phenomena in these Japanese datasets demonstrate limited diversity because the sentences they contain are restricted to simple structures.Kawazoe et al. (2017) provided JSeM, a manually-curated test set including a Japanese version of FraCaS to diagnose inference systems from a formal semantics perspective.We produced an NLI dataset by asking experts to translate the SICK dataset into Japanese, thus maintaining both sentence naturalness and compositions of linguistic phenomena.
While recent works (Sinha et al., 2021a,b;Gupta et al., 2021;Pham et al., 2021;Hessel and Schofield, 2021) have shown that pre-trained language models are insensitive to word order on permuted English datasets in the standard natural language understanding benchmark GLUE (Wang et al., 2019) including NLI, other works have analyzed sentence perplexity with varying word orders and shown controversial results regarding inductive biases for word order in different languages (Ravfogel et al., 2019;White and Cotterell, 2021).For six languages, including Japanese, Yang et al. (2019) evaluated whether multilingual BERT captures word order in the translated PAWS dataset (Zhang et al., 2019), involving adversarial paraphrase identification pairs whose sentences share words but differ in word order.Experiments showed that BERT performance for Japanese is consistently worse than that for Indo-European languages.Our study deepens insight into the causes of performance differences by stress-test evaluation of NLI and STS tasks through careful manipulation of case markers, which are more challenging tasks than are two-class paraphrase identification tasks.Kuribayashi et al. (2021) reexamined the general hypothesis that language models with lower perplexity are more human-like in Japanese than in English, and the results have shown the necessity of evaluating models across languages.Analyzing the model behavior, rather than perplexity, in transformed Japanese inference should provide new insights into the model's sensitivity to word order.

Translation
The original SICK dataset uses 6,077 sentences to provide 9,927 sentence pairs (A, B).To create the JSICK dataset, we first asked an expert translator to translate the 6,077 English sentences in SICK into Japanese.The translator did not see entailment labels, instead just translating a list of English sentences sorted alphabetically.The translations were independently validated by English-Japanese bilinguals, and no examples were discarded.To avoid changing sentence meaning during translation, we prepared translation guidelines and asked the translator to translate English sentences into natural Japanese while maintaining diversity in lexical, syntactic, and semantic phenomena such as hypernym-hyponym relations, activepassive alternations, and quantification in the original English sentences.Note that sentences in the JSICK dataset contain some translations that are unnatural due to cultural factors, but reflecting culture in translation is beyond the scope of analyzing the compositional inference ability of models.
The guidelines explain how to translate linguistic phenomena, including indefinite and definite articles, singular and plural nouns, passive verbs, negation, and quantification.We also asked the translator to try to keep word orders as consistent as possible with the original sentences.Some instructions from our guidelines are given in detail below.
Indefinite/definite articles The following describes our instructions regarding the distinction between indefinite and definite articles.The distinction between indefinite and definite articles is an important phenomenon that affects interpre-  tations of quantification (Hawkins, 1978;Heim, 1982).However, since Japanese does not have articles (Hinds, 1986;Shibatani, 1990), it is not obvious how to translate indefinite articles as in (1) and definite articles as in (2) into Japanese.
We therefore translated subject NPs as bare noun phrases, using the particle が (ga) when translating the nominative case involving an indefinite article, and using the particle (topic marker) は (wa) when translating the nominative case involving a definite article.Since the majority of sentences in the SICK dataset are episodic, we can correctly translate English sentences into Japanese by the above rule.
(1) 男性 man The man is not playing a guitar'

Singular and plural nouns
In examples like (3), we can translate plural nouns by adding a plurality suffix such as たち (-tachi).However, Japanese does not have a general way to form plural words like the -s suffix in English.Thus, as in (4), we prioritized sentence naturalness by not adding the plural suffix たち to the accusative case of words like エビ (shrimps).

Validation
There are issues with the gold labels in the original SICK dataset (Bowman et al., 2015;Kalouli et al., 2017).In addition, translation from English to Japanese can change the appropriateness of the entailment label used in English (see Section 3.4).Thus, instead of using the original gold labels, we used the crowdsourcing platform Lancers 1 to reannotate entailment labels and similarity scores for JSICK.Definitions of entailment labels (entailment, contradiction, and neutral) and similarity scores in a range of 1 (completely unrelated) to 5 (very related) for a pair (A, B) of sentences are the same as those for the original SICK.In our instructions, we noted that sentences A and B describe the same situation or event to avoid any indeterminacy of event and entity coreference that might cause inconsistencies in contradiction labels (Bowman et al., 2015).The annotators were six native Japanese speakers, randomly selected from the crowdsourcing platform.The authors annotated the gold labels with ten examples in the JSICK trial set (500 examples) to provide ten test questions.We asked the annotators to fully understand the guidelines to the point where they could assign the same labels as gold labels for all ten test questions.We adopted annotations that were agreed upon by a majority vote as gold entailment labels and adopted the average of the annotation results as gold similarity scores.For entailment labels, the authors also manually checked whether the majority judgement vote was semantically valid for each example.Since recent works (Pavlick and Kwiatkowski, 2019;Gantt et al., 2020) have demonstrated the importance of information for modeling disagreements in NLI datasets, we will publicly release the raw annotations with the JSICK dataset.
The average annotation time was 1 min per pair, and Krippendorff's alpha for the entailment labels was 0.65.There were 6,957 cases (70.1%) in which three annotators assigned the same entailment labels, 2,922 cases (29.5%) in which two annotators assigned the same entailment labels, and 48 cases (0.4%) in which the labels of all three annotators assigned different labels.For cases where the labels of all three annotators assigned different labels, the labels were determined by the consen-

Linguistic tags
To analyze the ability of models to capture various linguistic phenomena, we annotated the JSICK dataset with linguistic phenomenon tags.We provided a set of nine linguistic tags for linguistic phenomena: numerals, negation, quantification, passive voices, anaphora, conjunction, disjunction, modal, and additive particle.We automatically annotated multiple tags with premisehypothesis pairs, using Janome2 to process each premise and hypothesis sentence for morphological analysis and part-of-speech tagging.If results of either of the morphological analyses included phrase patterns related to a linguistic tag, the premise-hypothesis pair is annotated with that tag.
Table 1 shows examples of linguistic tagging in the JSICK dataset, and Table 2 shows the distribution of linguistic tags.Table 5 shows the results of comparing the percentage of linguistic tags in the JSICK test data and the two existing large Japanese NLI datasets mentioned in Section 2, Japanese SNLI (JSNLI) and the Japanese Realistic Textual Entailment Corpus (JRTEC).Compared with previous datasets, JSICK contains more linguistic phenomena, including numerals, negation, quantification, passive voice, anaphora, and disjunction.This indicates that the distribution of linguistic phenomena in the JSICK dataset is well balanced.

Dataset
Table 3 shows that the distribution of JSICK dataset gold labels is almost the same as that for the English SICK dataset.The distribution of JSICK sentence pairs across NLI and STS tasks (Table 4) also follows the same trend as in SICK; similarity scores for the entailment and contradiction cases tend to be in the range of 3 to 5, while neutral similarity scores are distributed.
The most common cases where translation changed the entailment label from that in the original SICK dataset were those where the labels are changed to neutral.There were 242 such examples, due to grammatical differences between English and Japanese.Table 6 shows some typical examples.One major grammatical difference that can change entailment labels is the distinction between singular and plural NPs.In English, the plural form mushrooms explicitly indicates that there is more than one mushroom.By contrast, there is no grammatical singular-plural marking in Japanese (Nakanishi and Tomioka, 2004), so the bare noun キ ノ コ ("mushroom") can be interpreted as either singular or plural.This caused split entailment judgments among the annotators.Other types of discrepancy are due to various lexical gaps.For instance, in Lexical Gap Example B in Table 6, the English word man can be applied to both men and women, while its natural counterpart (男性) in Japanese does not have such a generic meaning.As a result, the entailment label for this example is neutral, rather than entailment.

Experimental Setup
In this study, we experimented with two Japanese pre-trained language models: Japanese BERT (Devlin et al. (2019), jaBERT) pre-trained on Japanese Wikipedia, and Japanese RoBERTa (Liu et al. (2019), jaRoBERTa) pre-trained on Japanese Wikipedia and the Japanese portion of CC-100: Monolingual Datasets from Web Crawl Data (Conneau et al., 2020).
For jaBERT, we investigated performance differences between the BERT-base model3 pre-trained with 17 million sentences from Wikipedia articles and the BERT-large model4 pre-trained with 30 million sentences.
The configuration was the same as that for the original BERT model.To check whether methods for tokenization and masked language modeling (MLM) affect model performance, we compared three settings for the BERT-base model.In the (SUBWORD) setting, the model processes input texts with word-level tokenization by the MeCab morphological parser with a standard Japanese dictionary IPAdic (Asahara and Matsumoto, 2003), followed by WordPiece subword tokenization (Schuster and Nakajima, 2012).The vocabulary size was 32,000.In the (WHOLE) setting, the subword model was trained with whole-word masking enabled for the MLM objective.In the (CHAR) setting, the model processed texts with word-level tokenization based on the IPAdic, followed by character-level tokenization.
For jaRoBERTa, we compared the performance of the base model5 and the large model. 6The input text was segmented into words by the Japanese morphological analyzer Juman++ (Morita et al., 2015;Tolmachev et al., 2018), and each word was tokenized using SentencePiece. 7e also analyzed differences in behaviors of the Japanese and multilingual pre-trained language models.
As multilingual models, we used the multilingual BERT model (mBERT) trained with multilingual Wikipedia and the XLM-RoBERTa-base8 and XLM-RoBERTa-large9 models (Conneau et al., 2020) pre-trained on CC-100 containing 100 languages.For mBERT, we used a multilingual cased model10 , as is recommended for languages with non-Latin alphabets, like Japanese.For each setting, we used learning rates 2e −5 , 3e −5 , and 5e −5 and 3, 4, and 5 training epochs to tune for the best parameters.
For the NLI task, to investigate whether the size and quality of fine-tuned data affect performance, we fine-tuned pre-trained models on three types of training data: (i) JSICK training data (5K), (ii) JSNLI training data (533K), and (iii) both JSICK and JSNLI training data (538K).As mentioned in Section 2, JSNLI is a machine-translated Japanese SNLI dataset.Since both SICK and SNLI are derived from image captions, we hypothesized that JSNLI might improve model performance on the JSICK test set.We used four standard evaluation metrics for NLI tasks: precision (Prec), recall (Rec), macro F1-score (F1), and accuracy (Acc).To analyze whether entailment labels are learned and predicted by referring only  to hypothesis sentences, we investigated the performance of models trained on JSICK without the premise sentences.We performed five runs and present the averages below.We also report standard deviations for the accuracy of baseline results in the NLI task.As the baseline for the STS task, we used the Pearson correlation coefficient, Spearman correlation coefficient, and mean square error (MSE) between the prediction results for BERTScore (Zhang et al., 2020), a recent BERTbased model for unsupervised STS, and the gold similarity score.

Baseline results
Table 7 shows the evaluation results for NLI models.For all models, the accuracy on JSICK is lower than that on JSNLI, indicating that JSICK poses more challenges than does JSNLI.Since performance under the hypothesis-only setting was low, JSICK does not allow model predictions from hypotheses alone.
In the standard train/test split setting for JSICK, accuracy with the jaRoBERTa-large model had the best performance (acc.90.3%).Surprisingly, multilingual models such as XLM-RoBERTalarge and mBERT achieved comparable accuracy (89.1% and 89.2%, respectively).Among the multilingual models, the mBERT model had the best performance.For the jaRoBERTa, jaBERT, and XLM-RoBERTa models, those trained on larger texts achieved higher accuracies on NLI tasks.Among the tokenization settings for jaBERT, whole-word masking (WHOLE) provided the highest accuracy (82.4%).Regarding fine-tuning data, mixing the training data with the JSICK and JSNLI training sets improved model performance for the JSICK test set for all models except jaRoBERTa-large.Since the jaRoBERTa-large model trained with a single training set (JSICK or JSNLI) already demonstrated high performance, additional training data did not improve performance.
Table 8 shows the results from the unsuper-vised STS model.Interestingly, mBERT achieved nearly the same high performance as did jaBERT on the STS task.Among different tokenization settings for jaBERT, the character-based tokenization (CHAR) produced the highest performance.This is due to the difference between NLI and STS tasks.Similarity scores are affected by the token overlap between two sentences, as suggested by the fact that the contradiction cases tended to have higher similarity scores.Character-based tokenization allows more precise calculations of the token overlap, and thus might be suitable for STS tasks.

Relevance between entailment and similarity
We next analyze relations between entailment labels and similarity scores in cases where model predictions are difficult.Table 9 shows a distribution of accuracies from the jaRoBERTa-large and mBERT NLI models fine-tuned with JSICK for each similarity score.These results show that both Japanese and multilingual models struggled to predict entailment labels with low similarity scores, but their gold labels are contradiction.Both models also failed to predict cases where premise sentences are very similar to their hypothesis sentences but their gold labels are neutral.Table 10 shows a distribution of Pearson correlations on the JSICK STS test set for each entailment label.These results show that the STS models have the same trend for contradiction examples as do the NLI models; the STS models failed to predict low similarity scores in cases where a premise sentence contradicts a hypothesis sentence.phenomena.This suggests that training data quality is more critical for learning linguistic phenomena than is quantity.

Comparison with other languages
We next compare the performance of mBERT on the JSICK dataset with that on the original English SICK dataset (SICK-EN), the Portguese SICK-BR dataset (Real et al., 2018), and the Dutch SICK-NL dataset (Wijnholds and Moortgat, 2021).Since gold labels for the JSICK datasets differ from those for the SICK-EN and SICK-NL datasets (SICK-NL uses the same labels as SICK-EN), we also evaluated mBERT while assuming JSICK gold labels to be the same as those for SICK-EN.Table 12 shows the baseline results for mBERT with different languages in the SICK test set.For both STS and NLI tasks, mBERT performance was relatively higher for Japanese SICK than that for the other datasets.
Table 13 shows confusion matrices for multilingual BERT models on different languages of the SICK NLI test set.Comparing across languages, mBERT performance for contradiction cases was lower in Japanese.Table 14 compares mBERT performance on different languages of the SICK test set for each Japanese linguistic tag.Note that since linguistic phenomena manifest differently by language, Table 14 shows only an approximated comparison of linguistic phenomena.For the STS task, there was little difference among languages, but performance tended to be lower for problems involving additive particles and anaphora.Performance for problems involving additive particles was also low in the NLI task.Performance for problems involving disjunction was low for all lan-  guages.
The results of our experiments suggest that multilingual BERT models achieved high performance on SICK across languages.However, the results related to multilingual SICK for each linguistic tag indicate room for improvement regarding the use of multilingual models to capture anaphora, disjunction, and additive particles.Moreover, it remains to be investigated whether pre-trained language models are sensitive to compositional aspects of inference, such as word order and case marking in Japanese.In the next section, we describe the extent to which language models capture word order and case particles, phenomena that are characteristic of Japanese.

Evaluation setting
Japanese grammar allows both subject-objectverb and object-subject-verb orders, with the former usually taken as the basic word order and the latter derived by a scrambling operation (Hoji, 1985;Saito, 1985).Instead of word order, postpositional case particles function as case markers.
For example, the case particles ga, ni, o represent the nominative, dative, and accusative cases, respectively.The JSICK test set contains 1666, 797, and 1006 premise-hypothesis sentence pairs (A, B) whose premise sentences A include basic word orders involving ga-o (nominative-accusative), gani (nominative-dative), and ga-de (nominativeinstrumental/locative) relations, respectively.By transforming the syntactic structures of these pairs, we created a JSICK stress-test dataset involving word scrambling and particle-swapping to analyze whether models correctly capture the free-order property of Japanese.
Consider the examples of (A, B) pairs in Table 15, whose gold labels are entailment.We first provide a scrambled pair (A ord , B), where the word order of the premise sentence A is scrambled into o-ga, ni-ga, or de-ga order.Since the meaning of the sentence A ord is the same as that of the original sentence A, the semantic relatedness of the scrambled pair (A ord , B) should be the same as those of (A, B).To analyze whether models consider Japanese case particles when predicting entailment labels and similarity scores, we use a rephrased pair (A case , B), where the only case particles in the premise A are swapped, and a rephrased pair (A del , B), where the case particles in A are deleted.Since these transformations affect case relations in premise sentences, the meanings of sentences A case and A del should differ from those of the original sentence A. The semantic relatedness of the rephrased pairs (A case , B) and (A del , B) should thus also be changed.If a model has generalized word order and case particles, it should consistently predict the same labels for both (A, B) and (A ord , B).Moreover, the model should change  the labels for (A case , B) and (A del , B) to neutral.We therefore checked the extent to which models changed predictions for (A ord , B), (A case , B), and (A del , B) pairs as compared with those for (A, B).
To rephrase premise sentences, we first parse the sentences using the Japanese constituency parser depccg (Yoshikawa et al., 2017), transform the parse trees using Tsurgeon (Levy and Andrew, 2006), and produce their surface strings as rephrased premise sentences.For the JSICK stress set, we evaluated four Japanese pre-trained language models (jaRoBERTa-large, jaRoBERTabase, jaBERT-large, and jaBERT-base (WHOLE)) and three multilingual models (XLM-RoBERTalarge, XLM-RoBERTa-base, and mBERT) finetuned with the JSICK training set.For the NLI task, we also used crowdsourcing to collect human judgments on a subset of the JSICK stress set.We asked the same annotators as those assigning entailment labels for the JSICK dataset to also annotate entailment labels for the JSICK stress set.
We also asked the annotators whether each sentence pair was natural, meaning both premise and hypothesis sentences were grammatically correct.
Note that we asked them to judge entailment labels even for unnatural sentences.We selected 100 examples for each of three rephrase types and three case particle types, for 900 inference problems in total.

Results
Table 16 compares percentages of model and human predictions for entailment labels for the JSICK stress-test dataset that are the same as those for the original JSICK test set.We also show the results for the human naturalness (acceptability) rating task.While humans predicted the same labels for the scrambled examples, they changed their labels for examples where only the case particles were swapped or deleted.Interestingly, humans tend to change more predictions for those examples where only the case particles are swapped than for those where the case particles are deleted, but the acceptability rate for the former was much higher than that for the latter.One reason for this is that Japanese case particles can be dropped under adjacency to become a verb (Saito, 1985), in which case humans can complement the dropped case particles.On the other hand, the models predicted nearly the same labels for (A ord , B) pairs as for (A, B) pairs.In addition, the predicted labels for (A case , B) pairs and (A del , B) remained almost the same as those for the original (A, B) pairs.To investigate details of the model performance, we confirmed the percentage of predictions for each case particle in the JSICK stress-test dataset that are the same as those predicted for the origi-nal test set, as shown in Table 17.In the NLI task, all pre-trained language models predicted nearly the same labels even when the case particle is swapped or deleted, regardless of model type and the kind of case particle.These results indicate that the models predict entailment labels without considering word order and case particles.Similarly, for the STS task, Pearson correlations for (A ord , B) and (A case , B) pairs were nearly the same as those for the original (A, B) pairs.Pearson correlations for (A del , B) pairs were a little lower than those for the original (A, B) pairs, because the deletion of case particles in A del decreases word overlap between A del and B.
Augmenting training data for sensitivity to case particles To analyze whether data augmentation improves the model behavior for word order and case particles, we rephrased a small subset of the training set in three ways, creating (i) data where the NP argument is scrambled but its gold label is the same as the original, (ii) data where the case particle is deleted, and its gold label is set randomly, and (iii) data where only the case particle is scrambled and its gold label is set randomly.
We added 300 examples of each data type.These additional training data play a role in exposing models to three cues: (i) the order of NP arguments does not change entailment labels for sentence pairs, (ii) the existence of a case particle, and (iii) that its position can change entailment labels.

Conclusion
We introduced JSICK, a Japanese standard NLI/STS dataset, by manually translating English SICK into Japanese and re-annotating its gold labels.In baseline experiments, we compared the performance of various pre-trained Japanese language models on JSICK.While the Japanese RoBERTa-large model achieved state-of-the-art performance, the performance of multilingual pretrained language models achieved comparable results.Experiments with multilingual models on SICK datasets in different languages, including JSICK, showed that the performance of multilingual models was relatively low on inference problems involving anaphora, disjunction, and additive particles.Furthermore, to investigate the extent to which Japanese and multilingual pre-trained language models are sensitive to word order and case particles, we provided a JSICK stress-test dataset involving word scrambling and particle-swapping from JSICK.The results from that dataset suggest that both Japanese and multilingual models do not consider word order and case particles when making predictions for Japanese NLI/STS tasks.These are novel findings that are not obtainable from other datasets, including SICK in English and other languages.Overall, the results suggest large room for improvement of both Japanese and multilingual pre-trained language models regarding their sensitivity to flexible word order and the representations of case particles.Further improvements might be obtained by more adequate representations of Japanese vocabularies in multilingual pre-trained language models (Chung et al., 2020;Rust et al., 2021).We believe our dataset will be useful in future research for realizing more advanced models capable of appropriately performing multilingual compositional inference.

Table 1 :
[二人 NUM ]の女性が群衆の前でダンスをし[ながら CONJ ]歌っている [Two NUM ] women are dancing [and CONJ ] singing in front of a crowd B: [二人 NUM ]の女性が[多く QUANT ]の人の前でダンスをし[ながら CONJ ]歌っ ている [Two NUM ] women are dancing [and CONJ ] singing in front of [many QUANT ] people Examples from the JSICK dataset.Each ID corresponds to the ID in the original SICK dataset.

Table 2 :
Distribution of linguistic tags in JSICK.
woman is boiling shrimps'

Table 3 :
Distribution of JSICK and SICK sentence pairs for each gold entailment label and similarity score.Numbers in parentheses are percentages of the entire dataset.

Table 4 :
Distribution of JSICK sentence pairs across NLI and STS tasks.

Table 5 :
Comparison of linguistic tags between JSICK and previous Japanese NLI datasets.Numbers in parentheses are percentages of the entire test set.
ある 人 が キノコ を ナイフ で 切って いる (A person is cutting a mushroom with a knife) a person Nom mushroom Acc knife with cutting is B: ある 人 が いくつかの キノコ を 切って いる (A person is cutting some mushrooms) a person Nom some mushroom Acc cutting is ID: JSICK-2590 Lexical Gap A: ある 人 が ピンク色 の ロープ で 岩 を 登って いる (A person is climbing a rock with a rope, which is pink) a person Nom pink is rope with rock Acc climbing is B: 一人の 男性 が ロープ で その 崖 を 登って いる (One man is climbing the cliff with a rope) one man Nom rope with the cliff Acc climbing is ID: JSICK-648

Table 6 :
Examples of linguistic factors that cause differences in entailment labels between English and Japanese.

Table 7 :
Baseline results with Japanese and multilingual pre-trained language models for the NLI task with JSICK and JSNLI (%).

Table 9 :
Distribution of accuracies for the JSICK NLI test set for each similarity score.

Table 10 :
Distribution of Pearson correlations for the JSICK STS test set for each entailment label.

Table 11 :
Results on JSICK for each linguistic tag (%).We evaluated NLI models for accuracy and STS models with the Pearson correlation×100.N: JSNLI-train, I: JSICK-train, N+I: JSNLI+JSICK-train.Accuracies lower than the overall model accuracy are indicated in red.

Table 12 :
Baseline results from mBERT on different languages in the SICK test set (%).

Table 13 :
Confusion matrices for mBERT on different languages in the SICK NLI test set.Rec and Prec indicate "Recall" and "Precision", respectively.

Table 14 :
Comparison of mBERT performance for different languages in the SICK test set for each linguistic tag (%).Accuracies lower than the overall model accuracy are indicated in red.Ja (L-En) indicates evaluation results with gold labels from the original SICK test set.

Table 15 :
Evaluation settings for the JSICK stress-test dataset.

Table 16 :
Comparison between model and human predictions for entailment labels in the JSICK stress set that are the same as those for the original test set (%). Natural indicates human-rated results for naturalness (acceptability).

Table 17 :
Table18shows the percentage of predictions by the jaRoBERTa-large NLI model that are the same as those for the original JSICK when a subset of rephrased examples is added to the training set.As that table shows, data augmentation changed model predictions on examples where the case particle is swapped or deleted.This indicates that although the NLI model does not implicitly Percentage of predictions for the JSICK stress-test dataset that are the same as those for the original test set for each case particle (%)."Yes", "No", and "Unk" indicate accuracies on entailment, contradiction, and neutral examples, respectively.For the STS task, we calculated the Pearson correlation between predictions for the original pairs and those for the rephrased pairs.learn case particles during pre-training and finetuning, a small amount of data augmentation to learn word order and case particles can improve the model sensitivity to case particles.

Table 18 :
Percentages of model predictions that are the same as those for the original JSICK when a subset of rephrased examples is added to the training set to learn word order and case particles.jaRoBERTa-l+aug shows the result with data augmentation.