Abstract
We suggest a model for partial diacritization of deep orthographies. We focus on Arabic, where the optional indication of selected vowels by means of diacritics can resolve ambiguity and improve readability. Our partial diacritizer restores short vowels only when they contribute to the ease of understandability during reading a given running text. The idea is to identify those uncertainties of absent vowels that require the reader to look ahead to disambiguate. To achieve this, two independent neural networks are used for predicting diacritics, one that takes the entire sentence as input and another that considers only the text that has been read thus far. Partial diacritization is then determined by retaining precisely those vowels on which the two networks disagree, preferring the reading based on consideration of the whole sentence over the more naïve reading-order diacritization.
For evaluation, we prepared a new dataset of Arabic texts with both full and partial vowelization. In addition to facilitating readability, we find that our partial diacritizer improves translation quality compared either to their total absence or to random selection. Lastly, we study the benefit of knowing the text that follows the word in focus toward the restoration of short vowels during reading, and we measure the degree to which lookahead contributes to resolving ambiguities encountered while reading.
L’Herbelot had asserted, that the most ancient Korans, written in the Cufic character, had no vowel points; and that these were first invented by Jahia–ben Jamer, who died in the 127th year of the Hegira.
“Toderini’s History of Turkish Literature,” Analytical Review (1789)
1 Introduction
Ambiguity is part and parcel of natural language. It may manifest itself at the morphological level, the syntactic level, or at higher linguistic levels. For example, in the classic “garden path” sentence, “The old man the boat,” “old” can be a noun or an adjective, while “man” may be a noun or a verb. The point is that the prima facie more likely reading of “old man” as adjective-noun is found to be untenable by the end of the sentence, and the reader must retrace her steps and reinterpret the morphology and syntax to understand the intended meaning. Though ambiguity may be deliberate—as in poetry—it is usually desirable to keep it to a minimum.
We deal here with ambiguity at the morphological level, investigating the inclusion of optional disambiguating diacritics. We suggest a novel criterion for partial diacritization, namely, just enough to obviate the need for lookahead for disambiguation—to the extent possible. In other words, disambiguating diacritics are called for when the most likely interpretation—considering only what precedes in reading order—is erroneous.
Semitic languages form a branch of languages originating in the Middle East and include, among others, Arabic, Hebrew, and Aramaic. Most of the writing systems (orthographies) of those languages omit some or all vowels from their alphabet. Daniels and Bright (1996), in their sixfold classification of writing systems, call such scripts abjads. The missing vowels are typically covered by a set of diacritics, serving as a phonetic guide, but these signs tend to be omitted in standard writing.
In Arabic, there are a number of such short-vowel diacritics, collectively named harakat. Long vowels, on the other hand, are represented by a collection of matres lectionis, letters that otherwise serve as consonants (alif, waw, ya). Modern Hebrew is somewhat similar, but the use of matres lectionis is more haphazard. In both, full or almost-full vocalization (vowelization) is normally reserved for scripture and other archaic works, verse, works for children or beginners, and for loan words or foreign names.
A common characterization in modern psycholinguistics is that unvocalized Arabic and Hebrew have deep orthography since there is no one-to-one mapping between phonemes and graphemes. At the opposite end of the spectrum are languages with shallow (or phonemic) orthographies, such as Finnish and Maltese (a Semitic language), for which it is usually easy to pronounce any word given its letters. Arabic orthography is considered shallow when short vowels are present (Abu-Rabia 2001). But, when they are omitted, a reader needs to use some contextual information to resolve ambiguities in pronunciation and meaning.
Overall, fully vowelized Arabic text is considered much too complicated for ordinary reading and is rarely encountered. On the other hand, the lack of written short vowels in certain words, particularly homographs, may be detrimental to the ease of understandability and slows down reading. To resolve such pronunciation and sense ambiguities, it is often enough to add only one well-chosen short vowel. The issue we address is how to determine which added vowels are beneficial to the reader and which are excessive and undesirable.
The main motivation of this paper is to understand how to improve human intelligibility of Arabic texts, and potentially other languages with optional diacritics or punctuation, by automatically adding annotations that help resolve ambiguities encountered when one is reading normally. By adding just the minimally required vowels—for the language level of the reader, Arabic texts will hopefully be comprehended more easily.
For example, the word has a number of pronunciations with different meanings (e.g., “I/you/she/it discovered,” “I/you were discovered,” “she/it was discovered”), but with only two diacritics (damma and sukun) added on the key letters, reading becomes easier (“she/it was discovered”).
In many cases, deciding on the correct pronunciation of a word requires looking at the following words in the text, and not only at preceding ones. We claim that, when only information from prior words is needed to resolve any ambiguity of a given word, then the short vowels may be safely omitted, since by the time that word is encountered, the reader has already collected what is necessary for disambiguation. For example, here are two sentences (read right to left): “Hard work pays off,” and “Who is the grandfather of the boy?” The second word is jd in both sentences, but it takes a different diacritic on the last letter, which results in a completely different meaning. Only with the third word is a reader able to resolve the ambiguity of the second one.
In other languages, Spanish and French, for example, a relatively small number of diacritics are required, and often serve to disambiguate. Anecdotally, even native speakers of such languages resort to a spellchecker to insert them ex post facto and save keystrokes thereby. Many Eastern European languages, on the other hand, make extensive use of diacritics. See Figure 1.
Classical Greek and Latin were often written scripta continua, without interword spaces. Likewise, many Eastern languages do not normally use spaces or punctuation within sentences. This, too, introduces a level of ambiguity, which an analogous notion of partial punctuation could help resolve.
In English, diacritics are optional (except in some borrowed words and expressions such as “coup d’état”) and rarely used today (diaeresis on “naïve”; circumflex on “rôle”); they may indicate pronunciation but are not needed for understanding. Also, some commas and hyphens are optional punctuation in English, but can help one parse the sentence properly. In many other languages, however, diacritics are essential and are never omitted.
Our goal in this study is to improve the ease of understandability of Arabic text during reading by automatically generating diacritics, but only when they provide strong pronunciation hints to the reader. In some Arabic-language media outlets (e.g., Nature in Arabic1), partial diacritization is included to facilitate understandability. We propose a machine-learning model for partial diacritization. Deciding which vowels to recover is achieved by mimicking the way humans resolve pronunciation ambiguity. We train two networks for full diacritization, one taking the entire sentence into account, and another that considers only prior words. Partial diacritization is achieved by preserving those diacritics on which the two networks disagree, suggesting that without them disambiguation would require lookahead. Upon disagreement, we always take the diacritic predicted by the network trained on the entire sentence to be the final assignment.
It has been claimed (Marcus 1978) that to handle ambiguities as in garden-path sentences, which make understandability during reading more difficult, it is necessary to parse natural language either nondeterministically or by a deterministic parser with lookahead (LR(k)) capabilities. Our approach to diacritization is similar in using lookahead to determine what requires disambiguation.
We propose a new, dual neural-network architecture, designed to mimic human linear reading, on the one hand, and to model the impact of lookahead, on the other. By comparing the annotations of the two, one can determine what actually requires lookahead and what depends only on preceding text. We apply the dual-network to the specific problem of inferring sufficient optional diacritics to facilitate comprehension by human readers. The same approach can be applied to other reading aids, and can be tuned to the language level of the reader and the amount of lookahead presumed.
2 Related Work
Comparing reading processes in languages of different orthography depths (Liberman et al. 1980; Frost, Katz, and Bentin 1987; Katz and Frost 1992) is still an active area of research. In particular, the contribution of short vowels to reading of Arabic has been studied. Whereas several studies report a positive contribution (Abu-Rabia 1995,1996; 1997a,1997b,1998a,1998b,1999; Abu-Hamour, Al-Hmouz, and Kenana 2013; Taha 2016), a number (Ibrahim 2013; Asadi 2017) have shown a decrease in reading fluency (measured as the time to correctly read a text) and accuracy (the percentage of words correctly pronounced), due to the visual load and complexity of short-vowel diacritics in Arabic. A recent review (Abu-Rabia 2019) summarizes the conflicting results.
Bouamor et al. (2015) conducted a study of human annotation for minimal Arabic diacritization that showed a low inter-annotator agreement and demonstrated how subjective this task can be.
Various works apply deep neural networks to restoration of diacritics. Examples include Náplava et al. (2018) for Polish; Nuţu, Lőrincz, and Stan (2019) for Romanian; Hucko and Lacko (2018) for Slovak; Uzun (2018) for Turkish; and Nguyen et al. (2020); Hung (2018); Nga et al. (2019); Alqahtani, Mishra, and Diab (2019) for Vietnamese. A recent work (Stankevičius et al. 2022) employs a transformer-based (Vaswani et al. 2017) ByT5 network (Xue et al. 2022) for 12 European languages plus Vietnamese. Other transformer-based diacritization networks include Laki and Yang (2020); Dang and Nguyen (2020). For state-of-the-art vowelization of Hebrew, see Shmidman et al. (2020).
There is a large body of work on full Arabic diacritization. Early work took a more traditional machine-learning approach (Zitouni and Sarikaya 2009; Darwish, Mubarak, and Abdelali 2017); recent efforts are usually based on deep neural setups (Abandah et al. 2015; Belinkov and Glass 2015; Alqahtani, Mishra, and Diab 2019; Fadel et al. 2019a,2019b; Mijlad and Younoussi 2019; Mubarak et al. 2019; Abbad and Xiong 2020; AlKhamissi, ElNokrashy, and Gabr 2020). A few studies (Zalmout and Habash 2020; Alqahtani, Mishra, and Diab 2020) show the contribution of morphological data to diacritization. An encoder-decoder network using a Tacotron CBHG module (Wang et al. 2017) as part of the encoder has been introduced (Madhfar and Qamar 2020).
As may be expected, diacritization can help in morphological analysis (Habash, Shahrour, and Al-Khalil 2016) and with other natural language processing tasks (Alqahtani, Mishra, and Diab 2020). Recently, Alqahtani, Aldarmaki, and Diab (2019) evaluated the contribution of incomplete restoration of Arabic diacritics to a number of such downstream tasks. Estimating the errors introduced by a full diacritization algorithm, their approach is to restore the diacritics only for ambiguous words, which is what they refer to as selective diacritic restoration. Fadel et al. (2019a) developed a deep recurrent neural network for diacritization, which was reported to positively contribute to neural machine translation, by encoding the diacritics on a parallel layer to the input characters. We will use their architecture as a downstream task in the evaluation of our model for partial diacritic restoration. A decade earlier, Diab, Ghoneim, and Habash (2007) measured the contribution of different partial diacritization schemes to a statistical Arabic-to-English translation system. They found that translation quality is not improved when the input is partially diacritized. At the same time, they showed that the translation quality significantly deteriorates when the input is provided fully diacritized.
All the same, we focus more on improving understandability during reading, as opposed to improving the accuracy of a downstream-task algorithm. Our goal is to generate diacritics only for letters (not necessarily all letters of a word) that resolve ambiguities encountered during continuous reading of a running text. We intentionally do not resolve ambiguities that can be handled with information already available while reading a sentence normally from the beginning forward. To the best of our knowledge, this is the first time that this goal is being addressed algorithmically. The closest related work is by Alnefaie and Azmi (2017), who developed an algorithm for partial diacritization of a sentence by filtering out inconsistent morphological analyses of the words given the sentence as context. In their case, each word is diacritized to distinguish the intended reading from the otherwise most likely sense, whereas we aim to provide only enough diacritics to disambiguate the word considering the preceding context—as is commonly done in actual texts. Their morphological analyses were retrieved using a lexicon that contains all potential analyses for the words in a given sentence. Naturally, diacritics can be assigned only to words that exist in the lexicon, a limitation that does not exist in our approach. Their evaluation was done manually on a set of sentences, which have not been made publicly available.
We are, unfortunately, unaware of the existence of any relevant resources that could help train a supervised machine-learning algorithm for partial diacritic generation.
3 Arabic Morphology
Arabic, like most Semitic languages, enjoys a rich morphology. This includes verbal inflection (binyan, tense, mood, etc.), nominal cases, construct forms, prefixes for conjunctions, prepositions, the definite article, and more, and suffixes indicating gender, number (singular, dual, plural), and pronomial possessives. All together, these result in a high degree of ambiguity for Arabic words, with about 12 potential morphological analyses per word (without extra diacritics) on average (Habash 2010). An example of the morphological complexity of words is given in Figure 2. Arabic is written right to left. Letters often change form depending on their position (initial, medial, final) within the word.
4 Methodology
4.1 Morphological Ambiguity
Partial diacritization is the process of inferring a minimal subset of diacritics that is fundamental to disambiguate the context. This mission is not well configured, however, and there is no convention or explicit rules for how to accomplish it. We distinguish between ambiguous words that may be resolved using previously seen context, and ambiguous words that need some of the context that follows in order to improve resolution while reading. The former are easier to resolve when reading; therefore, we try to restore the diacritics only for letters of words that a reader usually needs to look ahead in order to improve the ease of understandability.
To imitate human readers and the hurdles they face, we train two distinct neural models for the task of full restoration of diacritics: One encodes information obtained in reading direction, not crossing the predicted word, ignoring what comes after. The second scans the entire sentence before diacritizing it in full; therefore, it is assumed that this model has a better chance to predict the correct diacritic. The idea is to provide pronunciation hints to the reader by assigning letters with diacritics only when they cannot be trivially decoded using the content that has already been taken into account by the first unidirectional neural network. Therefore, we train both networks to restore diacritics in full, and at inference time we assign the diacritics predicted by the second model only to letters for which the two models made different predictions. We describe both models in greater detail below.
Generally speaking, the input is composed of a sequence of Arabic characters c1,c2,…,cn, and the target is to predict a single label di for each character ci, representing its diacritic. Like previous work, we use the following set of labels to account for most of the diacritic types:
- (a)
Three short vowels (harakat): fatḥah /a/ ; kasrah /i/ ; and ḍammah /u/ .
- (b)
Three nunations (tanwins) to indicate case ending: /an/; /in/; and /un/.
- (c)
Gemination: shaddah.
- (d)
Sukūn (a circle above a consonant) to indicate vowel absence.
- (e)
Six more labels for capturing various combinations consisting of geminated vowels or nunations.
- (f)
A final label, no diacritic, indicating the absence of any diacritics on the letter.
All told, there are 15 labels.
4.2 Reading-Direction Model
For the first model, we use a four-layer unidirectional long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) architecture that works on the character level, and predicts one label per input character. We choose LSTM because in this model we are interested in mimicking the short-term memory characteristic of reading as performed by humans. We are not interested only in restoring diacritics that can be deduced merely by looking ahead a few words. Rather, we also aim to restore diacritics that are difficult to determine due to a high level of contextual ambiguity.
In addition to this LSTM architecture, we also experimented with an alternative, a character-based 6-layer transformer (Vaswani et al. 2017) encoder, each layer composed of 8 attention heads. To force the encoder to incorporate information only from characters that have been read so far, we masked all characters that follow the specific word that contains the processed character.
The LSTM architecture is shown in Figure 3(a). The alternate, transformer architecture is similar to the full-sentence model, which we describe next, and which is depicted in Figure 3(b).
4.3 Full-Sentence Model
To encode a full sentence before classification, we first experimented with several different architectures. A 6-layer transformer (Vaswani et al. 2017) working on the character level, each layer composed of 8 attention heads, followed by a fully connected layer, delivered the best accuracy for full diacritic restoration. As for the reading-direction model, we encode the input characters using a nonpretrained embedding layer. See Figure 3(b).
For training the models, we use sentences of up to 100 characters, not counting diacritics. Longer sentences are handled during training and testing by using a 100-character overlapping window that moves forward one word at a time. During testing, each character gets a number of predictions, one per window. The final prediction is decided on by weighted maximum vote (weights are predicted probabilities).
4.4 Training and Hyperparameter Settings
For the reading-direction model, our best results are obtained by using a hidden size of 16 for the word-character BiLSTM, resulting in a 32-dimensional vector for word embedding, which we concatenate with another 32-dimensional vector representing a character. This 64-dimensional vector is the input for the 4-layer unidirectional LSTM with hidden size of 512, followed by a fully connected layer of 15 output labels. We use dropout between the layers with 20% drop probability and the Adam optimizer, configured with a learning rate of 10−4. Batch size is 512. The reading-direction model has 7,505,778 parameters in total.
For the full-sentence and reading-direction transformer models, we use 6 layers for both encoder and decoder, 8 attention heads, and hidden size 512. We use 10%-drop dropout and ReLU as activation function. The full-sentence model has a total of 25,308,690 parameters.
We trained the reading-direction LSTM model for 10 epochs and reached optimal results on a validation test after 8; each epoch took about 12 hours on average. The transformer models reaches optimality after 6 epochs. (It takes about 55 minutes to run 10 epochs on a GeForce GTX 1080 Ti.)
5 Datasets
We make use of two datasets, one mostly consisting of classical Arabic texts and the other of tweets in modern standard Arabic.
5.1 Tashkeela
To train both models, we use the Tashkeela corpus (Zerrouki and Balla 2017), comprising more than 65M words, fully diacritized. It mostly contains classic works written in Classical Arabic (CA), the forerunner of Modern Standard Arabic (MSA), the main language used today in formal settings (in contradistinction to spoken Arabic, with its many mutually unintelligible regional varieties). Modern and Classical Arabic have much in common, but oftentimes they use different grammatical structures and vocabularies. Only 1% of the texts in Tashkeela are written in MSA.
We preprocess the corpus in a way similar to Fadel et al. (2019b); additionally, we replace a few rare letters by their natural equivalents (e.g., Farsi yeh into Arabic yeh and Farsi peh into Arabic beh). Furthermore, in Tashkeela, not all letters carry diacritics. Because we train our model one line at a time, we delete from the corpus lines that have a low rate of diacritics per letter to maintain relatively high support of all labels. Fadel et al. (2019b) removed lines below a rate of 80%. Because it appears that Modern Arabic has a lower rate than Classical Arabic, we remove lines below 50%, to have more MSA text during training. This left us with 20K of MSA text (corresponding to 604K words) out of the original 50K lines (801K). Additionally, we added the Holy Quran (HQ), fully diacritized, to the CA corpus. Table 1 summarizes the number of words and lines in our corpus.
5.2 Arabic Treebank: Part 3 v1.0
To compare our full-sentence model with two related studies, we train and evaluate it on the Arabic Treebank: Part 3 v1.0 (LDC catalog number LDC2004T11, ISBN 1-58563-298-8). The treebank contains approximately 300,000 Arabic word tokens with annotation that includes complete vocalization with case endings. Following previous work, we use the original train/test split for training and evaluation.
5.3 Tweets
Additionally, we collected 75 tweets written in MSA,2 taken from official accounts that cover news in different domains. As a first step, the tweets were fully diacritized by a professional native linguist, whom we hired specifically for this task. As a second step, the fully diacritized tweets were manually processed by three native MSA speakers, keeping 19%, 25%, and 26% of the diacritics, chosen to facilitate fluency. The native speakers were instructed to read the non-diacritized version of each tweet, and to keep the diacritics of those letters for which they failed to read fluently. The knowledge of MSA of all annotators is at mother-tongue level. Tables 2 and 3 show examples for two of the tweets. We will make this dataset publicly available.
6 Evaluation Results
Before considering the results of our method for partial diacritization, we check that the full-sentence network achieves reasonable accuracy. We also examine how the amount of lookahead affects fluency.
6.1 Full Restoration
To compare our full-sentence transformer model with state-of-art systems, we preprocess the Tashkeela+HQ dataset and use exactly the same train/test split as Fadel et al. (2019b). That means that, for the following experiment, we use the same diacritics per character rate of 80% as used in Fadel et al. (2019b). As is customary, we measure (1) diacritic error rate (DER), the percentage of misclassified letters, and (2) word error rate (WER), words with at least one misclassified letter. Following others, we evaluate our models under several varying conditions. Generally speaking, case endings (only when handled by nunations; other case-ending forms exist) are deemed more difficult than basic vowels, since they are syntax related; therefore, we also report results when discounting such nunation mistakes. Similarly, we also report results ignoring mistakes in predicting no diacritic, since the Tashkeela dataset often omits legitimate diacritics, and the predicted diacritic may in fact be correct. We only account for Arabic letters; foreign-script characters and digits are ignored.
Tables 4 and 5 compare the results of our transformer model for full diacritization with the best known models on the Tashkeela+HQ corpus, under the different conditions we mentioned above, and on the Arabic Treebank Part 3 corpus, respectively. For the latter, we use the original train/test split for training and evaluating our full-sentence model.
. | Includingno diacritic . | Excludingno diacritic . | ||
---|---|---|---|---|
with case . | w/o case . | with case . | w/o case . | |
Fadel et al. (2019a) | 2.60 / 7.69 | 2.11 / 4.57 | 3.00 / 7.39 | 2.42 / 4.44 |
AlKhamissi, ElNokrashy, and Gabr (2020) | 1.83 / 5.34 | 1.48 / 3.11 | 2.09 / 5.08 | 1.69 / 3.00 |
Our full-sentence model | 3.57 / 8.52 | 2.32 / 5.44 | 3.42 / 8.26 | 2.23 / 5.32 |
. | Includingno diacritic . | Excludingno diacritic . | ||
---|---|---|---|---|
with case . | w/o case . | with case . | w/o case . | |
Fadel et al. (2019a) | 2.60 / 7.69 | 2.11 / 4.57 | 3.00 / 7.39 | 2.42 / 4.44 |
AlKhamissi, ElNokrashy, and Gabr (2020) | 1.83 / 5.34 | 1.48 / 3.11 | 2.09 / 5.08 | 1.69 / 3.00 |
Our full-sentence model | 3.57 / 8.52 | 2.32 / 5.44 | 3.42 / 8.26 | 2.23 / 5.32 |
. | Withcase ending . | Withoutcase ending . | ||
---|---|---|---|---|
DER . | WER . | DER . | WER . | |
Zitouni, Sorensen, and Sarikaya (2006) | 5.5 | 18.0 | 2.5 | 7.9 |
Habash and Rambow (2007) | 4.8 | 14.9 | 2.2 | 5.5 |
Our full-sentence model | 3.7 | 12.2 | 2.0 | 5.7 |
Even though our goals did not include delivering a system for full diacritization, our transformer model is almost on par with the state of the art.
Because we care more about MSA than CA, we use a lower cleaning threshold (50%) and generate a new 85%/15% train/validation split for the simple reading-direction model. The first row of Table 6 shows the results of the simple reading-direction model trained and evaluated on this split.
Validation Dataset . | Model . | Including . | Excluding . | ||
---|---|---|---|---|---|
no diacritic . | no diacritic . | ||||
with case . | w/o case . | with case . | w/o case . | ||
50%-filtered validation | Reading direction (LSTM) | 13.0 / 34.0 | 9.3 / 26.1 | 8.3 / 25.4 | 5.3 / 16.9 |
MSA validation | Full sentence | 6.9 / 26.6 | 7.0 / 23.0 | 6.1 / 16.0 | 5.5 / 11.5 |
Tweets | Reading direction (LSTM) | 16.2 / 47.5 | 11.1 / 33.1 | 14.0 / 44.9 | 9.5 / 30.3 |
Reading direction (Transformer) | 15.4 / 36.7 | 11.3 / 26.9 | 12.4 / 33.1 | 9.0 / 23.6 | |
Full sentence | 9.2 / 27.7 | 6.3 / 18.1 | 7.2 / 23.6 | 5.0 / 15.0 |
Validation Dataset . | Model . | Including . | Excluding . | ||
---|---|---|---|---|---|
no diacritic . | no diacritic . | ||||
with case . | w/o case . | with case . | w/o case . | ||
50%-filtered validation | Reading direction (LSTM) | 13.0 / 34.0 | 9.3 / 26.1 | 8.3 / 25.4 | 5.3 / 16.9 |
MSA validation | Full sentence | 6.9 / 26.6 | 7.0 / 23.0 | 6.1 / 16.0 | 5.5 / 11.5 |
Tweets | Reading direction (LSTM) | 16.2 / 47.5 | 11.1 / 33.1 | 14.0 / 44.9 | 9.5 / 30.3 |
Reading direction (Transformer) | 15.4 / 36.7 | 11.3 / 26.9 | 12.4 / 33.1 | 9.0 / 23.6 | |
Full sentence | 9.2 / 27.7 | 6.3 / 18.1 | 7.2 / 23.6 | 5.0 / 15.0 |
Our ultimate goal is to generate partial diacritics by using the predictions of the full-sentence model only when the two models disagree. Therefore, we would like to make our full-sentence transformer model more capable than the simple reading-direction model. To improve performance of our full-sentence model on MSA, we train the model in two phases: (1) pre-training with CA+HQ texts (with 85%/15% train/validation split), and then (2) fine-tuning with MSA texts (with 90%/10% train/validation split), to end with weights that handle MSA better than CA. This fine-tuning training style gave us an improvement of 1.5% in word error over the same model that was trained on the entire training set in one phase. The second row in Table 6 shows the final results of the two-phase fine-tuned model on the MSA-only validation set. Finally, we evaluate all three models—the two simple reading-direction models trained on the 50%-filtered training set as well as the full-sentence two-phase model—on our MSA-only fully-diacritized tweets (last three rows). The full-sentence model clearly performs better across all metrics, as it should. The third row in Table 2 shows the diacritics predicted by our full-sentence model. As can be seen, the accuracy for this tweet is nearly perfect, including nunations and geminations.
The transformer-based reading-direction model performs slightly better than the LSTM one. However, as we show in the following section, the LSTM model actually works better for the partial restoration task at hand.
We performed some error analysis on the tweets collection. Recall and precision per type are reported in Table 7. For easier reading, in Table 8 we aggregate the diacritic types under four main categories: vowels (including fatḥah, kasrah, ḍammah, and their corresponding geminated versions), nunations (including their geminated versions), sukun, and no diacritic. (We ignore the two instances of shaddah.) Both precision and recall are calculated by comparing the models’ prediction with the gold-standard tweets. Overall, we see that the full-sentence transformer model performs better than the reading-direction model across all categories. Restoring nunation diacritics seems to be the most challenging task, especially for the reading-direction model. The geminated types are more difficult to predict than the non-geminated ones.
Type . | Full Sentence . | Reading Direction . | # . | ||
---|---|---|---|---|---|
Precision . | Recall . | Precision . | Recall . | ||
fatḥah | 92 | 92 | 85 | 86 | 1,493 |
kasrah | 92 | 92 | 85 | 82 | 875 |
ḍammah | 89 | 86 | 77 | 70 | 427 |
Geminated fatḥah | 65 | 91 | 57 | 82 | 113 |
Geminated kasrah | 69 | 79 | 63 | 60 | 53 |
Geminated ḍammah | 67 | 85 | 50 | 54 | 26 |
fatḥatān | 86 | 76 | 62 | 60 | 25 |
kasratān | 70 | 76 | 39 | 49 | 49 |
ḍammatān | 68 | 70 | 52 | 48 | 27 |
Geminated fatḥatān | 0 | 0 | 0 | 0 | 0 |
Geminated kasratān | 56 | 100 | 50 | 60 | 5 |
Geminated ḍammatān | 0 | 0 | 0 | 0 | 1 |
shaddah | 0 | 0 | 0 | 0 | 2 |
sukun | 91 | 91 | 85 | 88 | 642 |
no diacritic | 99 | 96 | 97 | 95 | 2,290 |
Type . | Full Sentence . | Reading Direction . | # . | ||
---|---|---|---|---|---|
Precision . | Recall . | Precision . | Recall . | ||
fatḥah | 92 | 92 | 85 | 86 | 1,493 |
kasrah | 92 | 92 | 85 | 82 | 875 |
ḍammah | 89 | 86 | 77 | 70 | 427 |
Geminated fatḥah | 65 | 91 | 57 | 82 | 113 |
Geminated kasrah | 69 | 79 | 63 | 60 | 53 |
Geminated ḍammah | 67 | 85 | 50 | 54 | 26 |
fatḥatān | 86 | 76 | 62 | 60 | 25 |
kasratān | 70 | 76 | 39 | 49 | 49 |
ḍammatān | 68 | 70 | 52 | 48 | 27 |
Geminated fatḥatān | 0 | 0 | 0 | 0 | 0 |
Geminated kasratān | 56 | 100 | 50 | 60 | 5 |
Geminated ḍammatān | 0 | 0 | 0 | 0 | 1 |
shaddah | 0 | 0 | 0 | 0 | 2 |
sukun | 91 | 91 | 85 | 88 | 642 |
no diacritic | 99 | 96 | 97 | 95 | 2,290 |
Type . | Full Sentence . | Reading Direction . | # . | ||
---|---|---|---|---|---|
Precision . | Recall . | Precision . | Recall . | ||
Vowel | 96 | 98 | 94 | 94 | 2,987 |
Nunation | 72 | 79 | 47 | 62 | 107 |
Sukun | 85 | 88 | 77 | 70 | 642 |
no diacritic | 99 | 96 | 97 | 95 | 2,290 |
Type . | Full Sentence . | Reading Direction . | # . | ||
---|---|---|---|---|---|
Precision . | Recall . | Precision . | Recall . | ||
Vowel | 96 | 98 | 94 | 94 | 2,987 |
Nunation | 72 | 79 | 47 | 62 | 107 |
Sukun | 85 | 88 | 77 | 70 | 642 |
no diacritic | 99 | 96 | 97 | 95 | 2,290 |
Additionally, in Table 9 we compared the prediction performance of the reading-direction model with the predictions of the full-sentence model, used as the gold standard. Following our partial restoration design, we keep those diacritics on which the two models disagree; therefore, this comparison shows that our partial-restoration approach keeps a relatively large percentage of the nunations and only a relatively small percentage of the vowels.
6.2 Partial Restoration
Both models are combined and used for generating partial diacritics as part of our model-difference approach. As mentioned above, diacritics are assigned only to letters for which the two networks disagree with regard to their predicted labels; in such cases, we output the predicted label of the full-sentence model, completely ignoring the label of the simplistic reading-direction model.
To evaluate, we match the predicted partial diacritics with those manually assigned to the tweets by native speakers. Our system decided to keep about 12% of the restored diacritics, while the native speakers kept 19%–26%. For a baseline, we randomly select 12% of the diacritics that were manually assigned by Annotator 1, who has the highest level of agreement with the model. Table 10 shows the improvement achieved by our model-difference method, using both the LSTM and transformer architectures, with the kappa coefficient for the 15-label partial diacritic restoration task (recall that one of the labels represents no diacritic), and F1 for a binary-classification task of diacritic assignment (with positive label being diacritic assignment). We can clearly see that our partial diacritization algorithm performs better than a baseline random selection of diacritics. It adds fewer diacritics than a human might, so recall is somewhat low, while precision is relatively high (because full-sentence diacritization is good), for an F1 score of 0.80–0.84. Subjectively, the results are quite pleasing for a native reader.
Diacritization . | Annotator 1 . | Annotator 2 . | Annotator 3 . |
---|---|---|---|
Random partial restoration | 0.11 / 0.77 | 0.10 / 0.80 | 0.11 / 0.77 |
Our partial restoration (LSTM) | 0.26 / 0.82 | 0.22 / 0.84 | 0.22 / 0.81 |
Our partial restoration (Transformer) | 0.15 / 0.80 | 0.13 / 0.83 | 0.16 / 0.80 |
Diacritization . | Annotator 1 . | Annotator 2 . | Annotator 3 . |
---|---|---|---|
Random partial restoration | 0.11 / 0.77 | 0.10 / 0.80 | 0.11 / 0.77 |
Our partial restoration (LSTM) | 0.26 / 0.82 | 0.22 / 0.84 | 0.22 / 0.81 |
Our partial restoration (Transformer) | 0.15 / 0.80 | 0.13 / 0.83 | 0.16 / 0.80 |
Although the transformer-based reading-direction model does a better job at diacritization than does the LSTM model, it has the opposite effect when it comes to partial restoration. The transformer reading-direction model agrees with the full-sentence model more often than the LSTM model does, resulting in fewer desirable diacritics being restored, perhaps because it is taking distant contextual information into account more than a typical reader does.
The last row of Table 2 shows the partial diacritics predicted by our model-disagreement algorithm. One can see that there are 11 predicted diacritics, which are mostly aligned with the gold-standard version provided in the row above it, containing 20 diacritics. The correctly predicted fatha over the final ya in the fourth word is a nice example for when the subjunctive mood is used, which has to end with a fatha due to the word that comes before. This information was not memorized well by the simple reading-direction model, even though it was encoded during processing. Another example is the sukūn over the last letter of the eighth word The simple reading-direction model predicted fatha, as in , which is what usually comes before a noun. However, looking ahead one word, the full-sentence model was able to predict the correct diacritic, sukūn, which is usually used before verbs. See Table 3 for another example.
We have done some error analysis and learned that both of our models, reading-direction and full-sentence, make more mistakes on proper names than on any other word types, resulting in less accurate partial restoration. Additionally, case endings are usually more challenging to predict than in-word diacritic marks—for both models. We provide the aggregated confusion matrix of our partial restoration model in Figure 4. (Mistakes involving the no diacritic label are not included so as to focus on diacritic assignment mismatches.)
6.3 Impact on Translation
Fadel et al. (2019a) suggested evaluating diacritic restoration by measuring their contribution to a diacritic-sensitive Arabic-to-English neural-network translation system. Even though our goal in this study has nothing to do with improving a downstream natural-language-processing task, we decided to follow the same approach as another extrinsic evaluation method for our partial diacritization algorithm. Therefore, we train the same translation system on one million sentence pairs, for which we restore diacritics in full with our full-sentence transformer model, and evaluate it on a test set under different conditions of diacritic assignment, including partial diacritization using our model-combination approach, random partial diacritization using the same diacritic-per-character rate, no diacritics at all, and full diacritization provided by our full-sentence transformer model. The evaluation results as BLEU scores (Papineni et al. 2002) are summarized in Table 11. This shows a small increased precision with our partial diacritic restoration, which kept about 10% of the diacritics over the two baselines. Machine translation benefits more, of course, when the input comes with all diacritics, since it was trained under the same conditions.
Diacritization . | BLEU Score . |
---|---|
No diacritics | 33.48 |
Random partial diacritization | 33.34 |
Our partial diacritization | 33.75 |
Full-sentence full diacritization | 34.25 |
(Fadel et al. 2019a) | 34.34 |
Diacritization . | BLEU Score . |
---|---|
No diacritics | 33.48 |
Random partial diacritization | 33.34 |
Our partial diacritization | 33.75 |
Full-sentence full diacritization | 34.25 |
(Fadel et al. 2019a) | 34.34 |
6.4 The Contribution of Lookahead
To measure the contribution of forward-looking context to fluency of reading, we ran additional experiments with the full-sentence transformer model, placing successively larger limits on the number of words following the current word that the transformer may encode. Table 12 summarizes DER and WER on the tweet dataset for each instance; the limit on lookahead is indicated in the first column. Under all evaluation conditions, the transformer model benefits from more and more lookahead in order to fully diacritize the current word. One word lookahead has a dramatic impact. But after that, there are diminishing returns. See Figure 5.
Lookahead (# words) . | Includingno diacritic . | Excludingno diacritic . | ||
---|---|---|---|---|
. | with case . | w/o case . | with case . | w/o case . |
0 | 13.70 / 43.81 | 7.87 / 23.62 | 11.63 / 39.96 | 6.49 / 20.40 |
1 | 9.06 / 27.47 | 6.33 / 18.63 | 7.07 / 23.31 | 5.01 / 15.50 |
2 | 8.89 / 27.06 | 6.22 / 18.42 | 6.88 / 23.00 | 4.89 / 15.30 |
3 | 8.83 / 26.53 | 6.24 / 18.31 | 6.80 / 22.48 | 4.91 / 15.19 |
4 | 8.77 / 26.43 | 6.18 / 18.21 | 6.74 / 22.37 | 4.85 / 15.09 |
Lookahead (# words) . | Includingno diacritic . | Excludingno diacritic . | ||
---|---|---|---|---|
. | with case . | w/o case . | with case . | w/o case . |
0 | 13.70 / 43.81 | 7.87 / 23.62 | 11.63 / 39.96 | 6.49 / 20.40 |
1 | 9.06 / 27.47 | 6.33 / 18.63 | 7.07 / 23.31 | 5.01 / 15.50 |
2 | 8.89 / 27.06 | 6.22 / 18.42 | 6.88 / 23.00 | 4.89 / 15.30 |
3 | 8.83 / 26.53 | 6.24 / 18.31 | 6.80 / 22.48 | 4.91 / 15.19 |
4 | 8.77 / 26.43 | 6.18 / 18.21 | 6.74 / 22.37 | 4.85 / 15.09 |
7 Conclusions
We have proposed a novel criterion for partial diacritization of Arabic and have implemented it as the difference between two neural networks that restore diacritics in full. One network uses only context that has already been read, and the other benefits from seeing the entire sentence prior to prediction. For evaluation, we manually diacritized a set of tweets written in Modern Arabic and then selectively marked those diacritics that contribute most to disambiguation during reading. Using this dataset, as well as a diacritic-sensitive neural-network machine translation system, we found our model-difference approach to be superior to the baseline method. We proffer this dataset to the community.
It bears keeping in mind that downstream machine translation and BLEU-score evaluation are less than ideal for measuring inherent ambiguity, especially since neural machine translation systems still leave much to be desired for languages like Arabic, with or without diacritics.
We have also quantified the impact of lookahead-window size on disambiguating pronunciation—measured by correctness of diacritics. The density of automatic partial vowelization of our method could be adjusted to obviate only more distant lookahead. In future work, we hope to compare fluency with different levels of lookahead-based vowelization.
We have chosen Arabic here as a convenient case study. Partial vowelization is of practical value for that language. We would like to expand our model-difference method to additional languages and to additional aspects of disambiguation, such as optional punctuation. The same idea could also be used to suggest cases in English—or other languages—where sentence structure could be simplified to ease reading comprehension.
Acknowledgments
We thank the annotators for their assistance and the reviewers for their suggestions.
Notes
From www.twitter.com.
References
Author notes
Action Editor: Preslav Nakov
All authors contributed equally.