Multiattentive Recurrent Neural Network Architecture for Multilingual Readability Assessment

We present a multiattentive recurrent neural network architecture for automatic multilingual readability assessment. This architecture considers raw words as its main input, but internally captures text structure and informs its word attention process using other syntax- and morphology-related datapoints, known to be of great importance to readability. This is achieved by a multiattentive strategy that allows the neural network to focus on specific parts of a text for predicting its reading level. We conducted an exhaustive evaluation using data sets targeting multiple languages and prediction task types, to compare the proposed model with traditional, state-of-the-art, and other neural network strategies.


Introduction
For decades, readability assessment has been used by diverse stakeholders-from educators to public institutions-for determining the complexity of texts (Benjamin, 2012). Traditional formulas do so by focusing only on superficial linguistic features (e.g., average length of sentences or syllables per word). This leads to criticism, as these formulas do not explore deeper levels of text processing and thus yield rough estimates of complexity (i.e., difficulty) that often lack accuracy (Arfé et al., 2018). In fact, traditional formulas can label a text as ''easy to read'' even if its content is completely nonsensical (Davison and Kantor, 1982).
To improve the quality of automatic readability assessment, researchers turned to more sophisticated techniques that go beyond examining shallow features. These techniques, typically based on supervised machine learning, incorporate hundreds (even thousands) of features that describe a text from multiple perspectives: syntax, morphology, cohesion, discourse structure, and subject matter (Dell'Orletta et al., 2011;François and Fairon, 2012;Denning et al., 2016;Arfé et al., 2018). The dependency on these numerous features, however, has made readability assessment tools too complex to deploy and apply to languages beyond the one for which they were originally designed. Furthermore, feature and language dependency, along with lack of homogeneity in terms of readability scales, often prevent researchers from comparing new strategies with state-of-the-art counterparts, preventing community consensus on which features are the most beneficial for capturing text complexity (De Clercq and Hoste, 2016).
Existing literature reflects the fact that applications that leverage text complexity analysis, including book recommendation or categorization (Lexile, 2016;Pera and Ng, 2014), Web result summarization (Kanungo and Orr, 2009), and accessibility in the health domain (Bernstam et al., 2005;Fitzsimmons et al., 2010), still favor less precise but easier to implement alternatives, with Flesch as the most accepted choice (Ballesteros-Peña and Fernández-Aedo, 2013;Bea-Muñoz et al., 2015). We argue that this is caused by the uncertainty induced by the lack of uniformity of readability scales, adaptability among readability assessment tools, and benchmarks.
Areas of study that were historically heavily dependent on feature engineering, including sentiment analysis or image processing (Manjunath and Ma, 1996;Abbasi et al., 2008), have made their way towards alternatives that do not involve manually developing features, and instead favor deep learning (Wang et al., 2016). This resulted in more reproducible strategies-easily portable to other domains or languages, as they only require implementing the structure of a specific neural network and just rely on core components of resources, such as words, signals, or pixels, rather than features specifically designed for a domain or language.
Issues pertaining to readability assessment are not limited to performance and adaptability. As stated by Benjamin (2012), a teacher should never use a readability score blindly when giving a text to a student, as specifics of the difficulties of the reader and the text should always be considered in this process. For this pairing to be successful, it is imperative for readability assessment tools to provide information beyond a single score. The explainability issue has been addressed in systems like Coh-Metrix (Graesser et al., 2011) by showing users the individual values of the features incorporated in the system. This strategy, however, has been criticized by the education community as most features presented are not straightforward to understand for people without background in both computation and linguistics (Elfenbein, 2011). More intuitive explanations could greatly ease the use of readability tools.
In this paper, we present a multilingual automatic readability assessment strategy based on deep learning: Vec2Read. 1 We still follow the premise of words being the core components for a neural network that deals with text. However, in order to avoid the aforementioned domain dependency issue and adapt the architecture to the readability task, we inform our model with part of speech (POS) and morphological tags. This is done by a multiattentive structure that allows the network to filter important words that influence the final complexity level estimation of a text. Apart from informing the network, the multiattentive structure can also be used to offer users further insights on which parts of a text have the most influence for determining its reading level.
Our research contributions include the following: • We propose a multiattentive recurrent deep learning architecture specifically oriented to the readability assessment task.
• The proposed strategy is, to the best of our knowledge, the first capable of estimating readability in more than two languages.
• We incorporate an attention structure that allows a model to use multiple focuses of attention (with different degrees of importance) to inform word selection.
• We conduct an exhaustive evaluation based on different languages, readability-measuring scales, and data sets of varied sizes, in order to compare the performance of Vec2Read with existing baselines, a comparison that is rarely done in this area due to lack of benchmarks.
• We present an initial analysis on the use of attention mechanisms as a potential alternative for providing explanations for readability.
Task Definition. Given a text t, use model M to predict its reading level. The functionality of M is directly dependent on the characteristics of a data set D used for training: language and readability scale. The scale can be discrete (binary or multilevel) or continuous. Any language is viable; for data set availability we train M for Basque, Catalan, Dutch, English, French, Italian, and Spanish.

Method
In this section we introduce Vec2Read, a multiattentive recurrent neural network architecture for readability assessment.

General Architecture
The general architecture of Vec2Read (illustrated in Figure 1) is designed to emulate the structure of a text. A text is inherently recurrent, as it is composed of a series of words that depend on each other in order to produce a message. A text is also hierarchical, as it is composed of structural components such as sentences or paragraphs in order to group information.Vec2Read takes into account both characteristics to better capture text structure. Unlike existing hierarchical neural networks that take advantage of both word and sentence level recurrent layers (Yang et al., 2016), Vec2Read has a single recurrent layer at wordlevel; hierarchical information is used to generate both word-and sentence-level attention scores for creating a text representation.

Input
Given a text t, let the input of Vec2Read be , and x m represent data structures containing a sequence of tokens in t, their corresponding POS tags, and morphological tags, respectively. x w i refers to the i th sentence in t and x w ij is the j th token in Figure 1: Description of the general architecture of Vec2Read.
x w i . x p i and x m i refer to the POS and morphological tag sequences for x w i , and x p ij and x m ij represent the POS and the morphological tags for x w ij . Note that x m ij contains a set of tags per word rather than a single token or POS label. For instance, given the word plays: To ease further processing, x m ij always contains all possible morphological tags considered for the language, assigning a Not applicable (NA) value when the label cannot be applied to the token-for example, tense would have a value of NA for all nouns. The number of tags used is language dependent. (See Section 3.1 for details on tag set used in the experiments.)

Dense Vector Representations
Dense vector representations or embeddings have shown to be useful for representing discrete values, such as words, in applications dealing with text (Tang et al., 2014;Madrazo Azpiazu et al., 2018). Vec2Read converts all discrete values in x into dense vector representations before feeding them to the model. This is achieved by using a lookup table Ω w ∈ R v×d where each row is an embedding for a specific word in the vocabulary, v is the vocabulary size, and d is the number of latent features used for representation. Similarly, lookup tables Ω p and Ω m are used for representing POS and morphological tags, respectively. ω w ij refers to the embedding of x w ij ; ω p ij to the embedding of the POS tag of x w ij ; and ω m ij to the embedding that captures the morphological information of x w ij created by concatenating the representations of each morphological tag in x m ij . Ω w , Ω p , and Ω m can be either initialized using random uniform distributions and then trained along with the other weights of our model or based on pretrained representations (see Section 3.1). Note that representations of each input type are maintained separately and can therefore be of different size.

Encoding Sentences and Words
A recurrent neural network (RNN) (Grossberg, 1988) is an extension of a traditional neural network where each node in a layer takes as input not only information from the previous layer but also from a node in the same layer located directly next to it. This creates a structure designed to handle sequences like words in a text. Unfortunately, traditional RNNs are prone to the vanishing gradient problem that makes them difficult to train, hindering final performance (Hochreiter, 1991). A long short-term memory (LSTM) network (Hochreiter and Schmidhuber, 1997) addresses a traditional RNN's vanishing gradient problem by using several gates on each RNN cell responsible for storing or forgetting information from the cell state.
Vec2Read uses a bidirectional LSTM network that considers the input sentences in forward and backward directions for creating representations of whole sentences and individual words. We refer to h w i as the representation of x w i , obtained by concatenating the outputs of the final states of the LSTM network in both the forward and final pass; h w ij is the representation generated by the LSTM network at time step j (i.e., for word x w ij ) for i, concatenating the outputs of forward and backward passes.

Textual Representation Layer
A final general representation of t, denoted h out , is created by aggregating all the encoded word representations generated by the LSTM network (Equation 1). This is done using a weighted sum over h w ij , where the weights are defined by the attention mechanism described in Section 2.6.
where a i is the attention generated for sentence i, a ij is the attention for x w ij , n i reflects the number of tokens in sentence i, and l is the number of sentences in t. The denominator is a normalization factor meant to remove the effect of length in texts. This normalization factor is especially important for readability prediction, given that the network could otherwise learn to discriminate texts based mostly on length, due to a strong bias in readability data sets for harder texts to be longer. Informing the model with length distribution of texts in each reading level could lead to performance improvement in an experimental setting. However, doing so would not allow us to estimate model performance in a real scenario, where text length will rarely follow the distribution seen in training sets. Therefore, we favor a length-independent model.

Attention Mechanism
Vec2Read is designed to capture the general structure of t in order to predict its reading level. Although one could argue that the reading level of a text is dependent on every one of its words, text simplification studies (Glavaš andŠtajner, 2015;Paetzold and Specia, 2016) indicate that difficulty is generally introduced in a text by specific words and sentences-just a few hard sentences could significantly increase overall text difficulty. Following this intuition, Vec2Read uses an attention-generation mechanism (described in Figure 2) capable of predicting which parts of t have the most influence in its overall difficult. This way, our model can focus on the important parts of t and provide a more accurate readability estimation.
The attention mechanism of Vec2Read works on two levels: sentence and word. It detects which sentences have most influence towards determining the reading level of t and also which words are most influential. Each of these twolevel predictions are composed of three attentions, oriented to consider the influence of each part of t from three linguistic perspectives: semantic, syntactic, and morphologic.
We now describe how the multiattentive mechanism works at word level, then we detail how to adapt this model for the sentence-level version.

Word-Level Attention
The word level attention mechanism consists of three single attention mechanisms that are aggregated. Each individual attention network follows the same structure, a two-layer neural network, only differing on the size of the input and the number of hidden units. We set the number of hidden units proportional to the input length (see Section 3.1 for configuration details). Specifically, we compute each attention score a att ij as follows: where att ∈ {w, p, m} is an attention type, W att and W att2 are the weights of the first and second network layers, b att and b att2 are their respective biases, s att ij is an intermediary representation, and σ is a sigmoid activation function. Similar to the model in Figure 1, the input for generating semantic and syntactic attention scores are ω w ij and ω p ij . For calculating morphological attention scores, the input is instead the concatenation of each of the morphological tag embeddings in ω m ij .
After generating a score using each single attention mechanism, Vec2Read aggregates them into one value that will be the final attention score predicted for x w ij . Previous works in feature engineering for readability assessment indicate that not all features are of equal importance for predicting the readability of a text (Dell'Orletta et al., 2011;Gonzalez-Dios et al., 2014). We believe that this phenomena also apply to attention generation, and therefore each single attention will not contribute equally to the final attention prediction.
To allow our model the flexibility of deciding which attention matters most, we use an attention aggregation strategy that assigns a different weight to each attention. z = < z w , z p , z m > is a vector containing the weights corresponding to each attention mechanism, which are automatically estimated during the training phase to allow Vec2Read to learn which attention has the most influence. We constrain the weights to sum to 1 by applying a softmax function to z: The final attention a ij for x w ij is calculated as: Lastly, we constrain all word attentions in a sentence to sum to 1 using a softmax function.

Sentence Level Attention
Sentence level attention follows the same structure as word level attention described in Section 2.6.1, differing only on how the inputs of each single attention network are generated. In this case, for the semantic attention we use h w i vectors already defined in the general architecture (see Figure 1); for syntactic and morphological attentions we feed separate LSTM models using the sequence of syntactic and morphological embeddings in the sentence and use the output of the last recurrent step as input to the attention mechanism. We then normalize sentence level attentions so that they sum to one using a softmax function.

Output Layer
The output layer of Vec2Read is responsible for mapping h out to a reading level prediction. Two different output layers are used depending on the type of prediction required in each task: discrete or continuous.

Discrete Prediction
To predict a discrete reading level for t, Vec2Read generates a probability distribution over each reading levelŷ ∈ [0, 1] c , where c represents the set of possible prediction classes, that is, reading levels. This is achieved by applying a fully connected layer with a softmax activation function to h out to ensure that the probabilities in y add up to one.
where W out ∈ R |c|×r is the matrix of weights of the fully connected layer, b out is a vector of length |c| containing the biases, |c| is the number of possible reading levels to be predicted, r is the number of latent features in h out , and refers to the transpose operation. The class that yields the highest probability is the one assigned to t.

Continuous Prediction
When the reading level of t is defined as a continuous value, Vec2Read generates a real valuê y ∈ [y min , y max ], where y min and y max refer to the minimum and maximum readability score possible in the used scale. This is achieved by applying a fully connected layer with a min-max leaky rectified linear unit as activation function.
The leaky version of this function is favored given its benefits in terms of avoiding neuron death during training (Xu et al., 2015a).
is the matrix of weights of the fully connected layer, b out is a bias, r is the number of latent features in h out , refers to the transpose operation, and ε is a constant set to 0.001 during training and to 0 during prediction.

Fitting Parameters
For fitting the parameters of our model we use stochastic gradient descent. This strategy computes the prediction of our model given specific data, and compares it to the actual objective value using an error or loss function. The goal is to minimize the error for which a gradient is backpropagated to each of the parameters in the model by subsequently updating them in a direction that will minimize the overall prediction error. As the objective function for training the model, we consider two different loss functions, depending on how the reading level is estimated.
For continuous predictions, we use instead mean square error (MSE): where D is a collection of texts in a given data set, |D| is the number of documents in D, andŷ d and y d are the prediction generated by our model for document d and its ground-truth, respectively.

Experiments and Discussion
In this section, we first describe model configuration. We then outline data sets and baselines considered for evaluation purposes. Lastly, we discuss the results of the analysis conducted to verify the overall performance of Vec2Read and showcase the validity of its attention mechanism.

Model Setup
We describe Vec2Read's configuration; parameters were empirically determined using a hold-out set as escribed in Section 3.4.
Optimization. For fitting the parameters of our model, we used the Adaptive Movement Estimation (Kingma and Ba, 2014); learning rate = 0.001.

Initializations.
For Ω w we used a pretrained version of word embeddings, which were trained using a skip-gram algorithm on Wikipedia documents, as described in Bojanowski et al. (2017). All the remaining weights and biases of our model, as well as initial states of LSTM layers, were initialized using a random uniform distribution.
Dimensions. The number of hidden units in the semantic, syntactic, and morphologic LSTM networks were empirically set to 128, 32, 64, respectively. The dimensions of the embedding representations were set to 300, 16, 16. Given that the input of the morphological attention combines multiple embeddings corresponding to the morphological labels used, the final dimension of ω m ij is u × 16, where u is the number of tags used.

Data Sets
For assessment and analysis purposes, we use several data sets based on both expert-labeled  (2016), generated using crowd-sourcing techniques. It consists of 105 documents both in English and Dutch, each labeled with a score in the 0-100 range that indicates its complexity.
Newsela. Newsela is an instructional content platform that provides reading materials for classroom use. As part of their research program, Newsela makes available a sample of their labeled corpora, which we use for evaluation. The data set consists of 10,786 documents distributed among grade levels 2-12 (around 1,200 per level for English and 120 for Spanish). We excluded from our experiments grade levels 2, 10, and 11, as the number documents for those levels are significantly lower when compared with other classes (284, 11, and 2, respectively, for English). (2018) is an online resource for learning Basque containing articles leveled following the Common European Framework of Reference for Languages. Using this source, we created a data set consisting of 5 reading levels (A2, B1, B2, C1, and C2), with 200 documents per level. Level A1 was omitted due to insufficient documents.

Ikasbil. Ikasbil
Wizenoze. Data set provided by Wizenoze (2018), an online platform dedicated to easing the retrieval of (curated) resources suitable for the classroom setting. The data set consists of 2,000 documents in English and Dutch, equally distributed and labeled using a 5-level readability scale (1-5).

Compared Strategies
We now describe the strategies considered in our assessment, including traditional formulas, stateof-the-art tools based on extensive feature engineering, and neural network structures intended for an ablation study on major components of Vec2Read.

Traditional Strategies
Flesch. Even if simple, Flesch (1948) remains one of the most used readability formulas and is therefore treated as a baseline by authors of publications pertaining to readability estimation. In addition to the traditional version for English texts, we consider language-specific adaptations (Kandel and Moles, 1958;Fernández Huerta, 1959;Douma, 1960;Lucisano and Piemontese, 1988). We followed the framework used in Madrazo Azpiazu (2017), which maps the Flesch score of a given text t into a binary value (simple or complex) based on its distance with the average Flesch score computed using the training documents for the respective classes.
3.3.2 State-of-the-Art Strategies S1. The system proposed by De Clercq and Hoste (2016) is the only one designed for readability assessment for more than one language: Dutch and English. Its design consists of a support vector machine that uses ad hoc features to capture varied linguistic characteristics of texts (e.g., syntax or semantics). Given that the algorithm implementation is not publicly available, comparisons against this strategy are based on results reported in De Clercq and Hoste (2016).

S2.
A multilevel Basque readability assessment strategy that relies on random forest and linguistic features with a major emphasis on morphology and syntax (Madrazo, 2014). The authors provided their data set (including cross-validation folds) for comparison purposes. Because of lack of implementation availability, comparisons against S2 are limited to the Basque language.
S3. Similar to S2, the strategy introduced in Madrazo Azpiazu (2017) also relies on a random Forest and linguistic features. Given implementation availability, we adapted it to run on all discrete and continuous prediction tasks by changing its linguistic annotation tools. For fairness in the comparison, we used the same linguistic annotation tools used by Vec2Read (described in Section 3.1). S1, S2, and S3 are treated as examples of feature engineered state-of-the-art strategies.

Ablation Study Strategies
To determine the utility of each feature incorporated in the architecture of Vec2Read, we consider several variations of Vec2Read in the assessment.

FC.
A two-layer fully connected neural network with 256 hidden units, taking as input the average of the word embeddings of all words in a text.
¬Attention. Basic architecture of Vec2Read. It maintains Vec2Read's hierarchical and recurrent structure, but overrides the output of its attention generation mechanism by assigning each word and sentence a uniformly distributed attention score. ¬Word, ¬Sent, ¬Sem, ¬Syn, and ¬Morph. Vec2Read architecture without word-level, sentence-level, semantic, syntactic, and morphological attention, respectively.

Experimental Setup
We followed a 10-cross-fold validation framework for measuring the performance of each strategy considered. A disjoint stratified 10% of data in SimpleWiki (includes both simple and complex) was excluded from the experiments and used for developmental and hyper-parameter tuning purposes. Note that to abide by the adaptability premise intended for our model, we only tuned hyper-parameters for English. Doing so allows us to understand to what extent the model can directly transfer to other languages without language-specific tuning, thus simulating a realworld scenario for tool adaptation.
To conduct fair comparisons, we used the same cross-validation folds across experiments (when possible, we used the folds made publicly available; otherwise we re-run strategies using our data and folds). The only exception are experiments related to S1, for which we could only access the original data set. Consequently, we compare our results with respect to those published in De Clercq and Hoste (2016).  Table 2: Performance comparison among traditional, state-of-the-art, ablation strategies, and Vec2Read on different data sets. '*' denotes statistically significant improvement over counterparts (Flesch,S1,S2,S3,Vec2Read). Accuracy (higher is better) is reported for all data sets except for MTDE, where RMSE (lower is better) is used in order to be able to compare with S1. Cells marked with '-' denote that the strategy is not applicable to the data set.

Overall Performance
As mentioned by De Clercq and Hoste (2016), each work in the readability area interprets the readability estimation task in a different mannerusing different languages and data sets-often making the community unable to compare proposed tools with each other. In order to best contextualize the performance of Vec2Read, we consider a broad set of tasks using data sets of varied (i) size, that go from 105 documents to 262,918, (ii) language, considering seven languages, and (iii) prediction type, namely, binary, multilevel, and continuous predictions.
To quantify performance of different readability estimation alternatives, we use accuracy for classification tasks and Root Mean Square Error (RMSE) for regression tasks. Table 2 summarizes the results obtained by Vec2Read and its counterparts on the aforementioned data sets. As we followed a 10-cross-fold validation framework, scores in Table 2 correspond to the averages over the 10 folds. Statistical significance was tested using a paired t-test with a confidence interval of p < 0.05.
General Discussion. As anticipated, we observe that traditional formulas (Flesch) yield the lowest performance, followed by the general-purpose neural network approach (FC). This validates our hypothesis that a neural network that simply considers words without considering text structure or other linguistic features is not enough for readability assessment. Further, models that consider richer traits of text, such as Vec2Read and its attention-less version (¬Attention), are consistently comparable or outperform state-of-the-art strategies (S1, S2, S3), demonstrating the validity of the proposed architecture. Vec2Read achieved a statistically lower rate only for 1 out of 14 tasks (defined as a data set-language pair) in our evaluation. We attribute this to the size of the data set, which only includes 105 texts. It is anticipated for a strategy based on feature engineering such as S1, which has been specifically designed for Dutch, to outperform a neural network based counterpart (such as Vec2Read), as the latter is known to need large amounts of data for best performance.
Data set size. The number of instances used for training has a strong effect on the overall performance of Vec2Read. All the analyzed strategies generate lower scores for smaller data sets; performance drop is more prominent among the strategies based on deep learning (Vec2Read and all the ablation strategies). We attribute this behavior to the higher variance of deep learning  models, needing more data than feature engineered models to achieve good generalization. In addition, we also note that the attention mechanism becomes more useful the larger the data set and its effect is negligible in small data sets such as MTDE.
Language and task type. We observe no emerging patterns in terms of performance induced by the language or the type of task. One could argue that results for English are in general higher, although we attribute these differences to data set size (English data sets are in general larger) rather than to the language itself. Accuracy scores for multilevel estimation are lower than for binary, which is expected, as it is harder for a model to learn readability predictions for scales that go beyond just simple or complex.
Ablation study. By comparing Vec2Read with its attention-less counterpart (¬Attention) we can conclude that the proposed multiattentive mechanism has indeed a positive effect for readability prediction. In 11 out of 14 tasks the multiattentive mechanism achieved statistically significant improvements over ¬Attention; for the remaining 3 tasks (VikiWiki-EU, MTDE-EN, MTDE-DU) there was no statistically relevant difference. The usefulness of the attention mechanism is influenced by the size of the data set, as the larger the data set, the more prominent the improvement obtained by the model using the attention mechanism. We also notice that the difference of using the morphological attention for certain languages such as English is insignificant whereas it is more prominent in other languages, a fact we attribute to the low morphological diversity of English.

Attention Mechanism
As outlined in Section 3.5, the attention mechanism of Vec2Read leads to improvement in prediction performance. In this section, we aim to shed light on what the attention mechanism is actually learning to do and whether this information could be used for explaining the estimated reading levels from a more qualitative standpoint. Even if attention mechanisms are used in manifold applications, there exists no defined framework for evaluating their behavior. Instead, researchers focus on finding explanations of what the mechanism is learning (Hermann et al., 2015;Xu et al., 2015b). For this reason, the following discussion is not intended to be conclusive but instead to provide initial results meant to be inspirational for future work on readability prediction explainability.
In order to illustrate the parts of a text that receive the most attention from Vec2Read, we show in Table 3 the top-3 words, POS, and morphological tags that score the highest attention level for each individual task. We observe that words that receive most attention are in general words that are not frequently used by an average speaker, and therefore can present a challenge for the reader. We also observe that conjunctions (used for making sentences longer) are consistently among the most influential POS tags and that subjunctive mood, passive voice, and specific verb forms, such as infinite or participle, are considered important by our model. Both the use of conjunctions and passive voice align with features already found positive in the readability literature (François and Fairon, 2012;Gonzalez-Dios et al., 2014), leading us to infer that the attention mechanism is learning valid assumptions for detecting which parts of a text are most influential for readability prediction.
One of the benefits of using a multiattentive mechanism rather than a traditional attention mechanism that considers all features at once is that the model can adapt and give more importance to specific datapoints depending on the task. In order to illustrate how Vec2Read takes advantage of this functionality, we show in Table 4 the weights 4 assigned by the attention mechanism for each task (i.e., z norm ). We observe that higher weight is assigned to semantics when the data set is large, whereas syntax is more relevant for smaller data sets. This behavior depicts the adaptability of our model, using more generalizable information, such as POS tags, when data is scarce and taking advantage of more fine grained information, such as words, when data is abundant. Weights for morphological and syntactic attention are similar for most of the tasks with the exception of English, where morphology receives a lower weight compared with other languages. We attribute this phenomena to English being a morphologically poor language.
Consider Figures 3 and 4, two examples of attentions generated by Vec2Read. Figures 3 showcases the combined attention scores a ij predicted by Vec2Read for a text snippet extracted from the English Wikipedia document about Qatna. The model used for predicting the attentions was trained using the SimpleWiki data set. In this example, we see that Vec2Read mostly focuses on complex nouns and adjectives, and tends to ignore less informative words, such as determiners.
Figures 4 shows the attentions generated for a sentence in Spanish by Vec2Read trained using Spanish VikiWiki. This example is meant to illustrate the ''extra" information that can be obtained from a multiattentive mechanism, not only by showing which of the words are important for estimating text difficulty, but also hinting about why they influence the process. As captured in Figure 4, the connector Consequentemente (Consequently) is most important from a syntactic perspective, whereas the sequence fue cerrado (was closed) is more important from a morphological standpoint.
Manual analysis of the attention scores lead us to identify which parts of a text the model is focusing on. This initial examination reveals that the model is indeed learning about linguistic patterns known to be important for defining the difficulty of a text as opposed to stylistic biases caused by how the data sets were generated. This also serves as an indication for the validity of using crowd-sourced data sets, such as SimpleWiki and VikiWiki, for training purposes.
We found many examples where the multiattentive mechanism yielded interesting outputs, however, we also found some deficiencies we would like to highlight. Even if connectors, like Consequentemente, were detected correctly by Vec2Read, other commonly used connectors, such as sin embargo (nevertheless) or a pesar de ello (nonetheless), were not detected correctly given their multi-word structure. This indicates that word level attentions might not be enough for some languages, thus demonstrating the need to consider more sophisticated structures such as dependency trees, as well as other syntactic and morphological features of the text, in the future.

Related Work
Literature on automatic readability assessment is rich, not only in the languages to which existing strategies can be applied, but also on the diversity of linguistic perspectives that have been explored (Benjamin, 2012;Arfé et al., 2018).
Feature engineering has been the main focus in the readability assessment area. Techniques that exploit shallow features (e.g., number of syllables per word and average sentence length) remain a prominent strategy for estimating complexity levels of texts in diverse languages (Flesch, 1948;Spaulding, 1956;Al-Ajlan et al., 2008) and show better prediction capabilities than more sophisticated features when considered individually (Feng et al., 2010). Language models have also been proved useful when determining the reading level of a text (Schwarm and Ostendorf, 2005). The use of features capturing the syntax of a text have been demonstrated to be of great importance, as illustrated by Karpov et al. (2014), who built a system that heavily relies on features based on POS tags and the syntactic dependency tree of a text. Structural features may not influence text complexity estimation for languages like Chinese, which is why some researchers favor analyzing lexical representations (i.e., term frequencies; Chen et al., 2011). Even if not for most   Magnitudes are provided by the attention mechanism and the polarities are determined by the readability prediction generated when using each word as input to Vec2Read.
languages, morphological features have also been shown to be of great importance in terms of influencing the complexity level of texts written in languages known to be morphologically rich, such as Basque (Gonzalez-Dios et al., 2014). For considering semantic information in a text, existing works incorporate features related to true or false cognates, as a manner to better capture text difficulty for non-native readers (François and Fairon, 2012), or measure the coherence of the text based on graphical models (Mesgar and Strube, 2015, 2016. Unlike the aforementioned techniques, which rely on engineering features for specific languages and tasks, Vec2Read uses a deep learning strategy that automatically detects patterns related to readability. Historically, readability assessment tools have been designed and evaluated in one language. To the best of our knowledge, only De Clercq and Hoste (2016) evaluate readability assessment performance in more than one language (i.e., Dutch and English) with the purpose of comparing the importance of features in each language. As presented in this paper, we go beyond two languages and instead quantify the performance of Vec2Read in seven different languages.
Attention mechanisms have been used with great success in several domains, including image classification (Xu et al., 2015b), question answering (Hermann et al., 2015), and automatic text translation (Bahdanau et al., 2014). The attention mechanism proposed for Vec2Read differs from the counterparts applied to the aforementioned tasks in the sense that it provides a composed attention score that can be decoupled to further analyze the influence individual words have in the overall complexity of a text from different linguistic perspectives.

Conclusion
We introduced Vec2Read, a multiattentive recurrent neural network architecture designed for automatic multilingual readability assessment. Vec2Read takes advantage of deep learning techniques by incorporating a multiattentive mechanism that allows the system to consider words and sentences that most influence the reading level of a text. We demonstrated the validity of our proposed architecture by conducting an exhaustive analysis using data sets in seven different languages and comparing Vec2Read to traditional, state-of-theart, and other neural network architectures. Moreover, we outlined the benefits of this type of architecture for readability assessment, including the interpretability of the predictions using the attention scores. This research work sets the foundations for language agnostic readability assessment, demonstrating that it is indeed possible to design a readability assessment strategy that works regardless of the language. This is achieved by disregarding hand-engineered features, historically known to be tedious to create and test, in favor of using simple tokens as input. We anticipate that given the magnitude and the diversity of the evaluation conducted, we have set a new baseline in the readability area, considerably harder to beat than the popularly used Flesch. This is supported by (i) the use of data sets in multiple languages that can, for the most part, be easily obtained and (ii) the release of our algorithm, so that other researchers can run it for comparison purposes. We expect this will make an area that is currently crowded with hard-to-compare systems finally progress towards more precise, usable, and comparable tools.
In the future, our research will be focused on generating more valuable explanations on what influences the readability of a text, as well as enhancing our model so that it can be trained jointly for multiple languages or can obtain benefit of cross-lingual data in order to improve the performance in languages with small corpora. We also plan on experimenting with character-based models, which could potentially take advantage of morphological information of texts without the need of a morphological tagger.
2008. Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums.