Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages

Most combinations of NLP tasks and language varieties lack in-domain examples for supervised training because of the paucity of annotated data. How can neural models make sample-efficient generalizations from task-language combinations with available data to low-resource ones? In this work, we propose a Bayesian generative model for the space of neural parameters. We assume that this space can be factorized into latent variables for each language and each task. We infer the posteriors over such latent variables based on data from seen task-language combinations through variational inference. This enables zero-shot classification on unseen combinations at prediction time. For instance, given training data for named entity recognition (NER) in Vietnamese and for part-of-speech (POS) tagging in Wolof, our model can perform accurate predictions for NER in Wolof. In particular, we experiment with a typologically diverse sample of 33 languages from 4 continents and 11 families, and show that our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods; it increases performance by 4.49 points for POS tagging and 7.73 points for NER on average compared to the strongest baseline.


Introduction
Transfer learning is a toolbox to extract knowledge from a source domain and perform sampleefficient generalizations in a target domain Talmor and Berant, 2019). In practice, this approach holds promise to mitigate the data scarcity issue which is inherent to a large spectrum of NLP applications (Täckström et al., 2012;Agić et al., 2016;Ammar et al., 2016;Ziser and Reichart, 2018, inter alia).
In the most extreme scenario, zero-shot learning, no annotated examples are available for the target domain. For instance, zero-shot transfer across languages leverages information from resource-rich languages to tackle the same task in a previously unseen target language (Lin et al., 2019;Artetxe and Schwenk, 2019;Ponti et al., 2019a, inter alia). Zero-shot transfer across tasks within the same language, on the other hand, verges on the impossible, because no information is provided for the target classes or scalars (in the case of regression). Hence, cross-task transfer has mostly focused on the few-shot or joint learning settings .
In this work, we show how the neural parameters for a particular unseen task-language combination can be approximated by zero-shot transferring knowledge both from related tasks and from related languages, i.e., from seen combinations. For instance, the availability of a model for part-ofspeech (POS) tagging in Wolof and for namedentity recognition (NER) model in Vietnamese supplies plenty of information which should be effectively harnessed to estimate the parameters of a Wolof NER model.
As our main contribution, we introduce a generative model of a neural parameter space, which is factorized into latent variables 1 for each language and each task. All possible task-language combinations give rise to a task × language × parameter tensor. While some entries could be populated through supervised learning, completing the empty portion of such a tensor is less straightforward than standard matrix completion methods for collaborative filtering Dziugaite and Roy, 2015), as the parameters are never observed. Rather, in our approach the interaction of the latent variables determine the parameters, which in turn determine the data likelihood. We adopt a Bayesian perspective towards inference on the proposed generative model. The posterior distribution over latent variables is approximated through stochastic variational inference (VI), based on the mean-field assumption (Hoffman et al., 2013). We choose multivariate Gaussians as a variational family for the latent variables. Given the enormous number of parameters, however, a full co-variance matrix cannot be stored in memory. Hence, we explore a factor co-variance structure that is expressive while remaining tractable. We evaluate the model on two related sequence tagging tasks: POS and NER, relying on a typologically representative sample of 33 languages from 4 continents and 11 families. The results clearly indicate that the proposed neural parameter space factorization method surpasses standard baselines based on cross-lingual transfer 1) from the (typologically) nearest source language; and 2) from the source language with the most abundant in-domain data (usually English). The average gains over the strongest baseline are 4.49 points for POS tagging and 7.73 points for NER. The code is available at: github.com/cambridgeltl/ parameter-factorization.

A Bayesian Generative Model of Parameter Space
The annotation efforts in NLP have achieved impressive feats, such as the Universal Dependencies (UD) project (Nivre et al., 2019) which now covers over 70 languages. Yet, these account for only a meagre subset of the world's 8,506 languages according to Glottolog (Hammarström et al., 2016). At the same time, the ACL Wiki 2 lists 24 separate language-related tasks. The lack of costly and labor intensive labelled data for many languages in such tasks hinders the development of computational models for the majority of the world's languages (Snyder and Barzilay, 2010;Ponti et al., 2019a).
In this work, we propose a Bayesian generative model of multi-task, multi-lingual NLP. We train one Bayesian neural network for several tasks and languages jointly. The core modeling assumption is that the parameter space of the neural network is structured, that is, that certain parameters correspond to certain tasks and others correspond to certain languages. This structure allows us to generalize to unseen task-language pairs. The model, which is reminiscent of matrix factorization for collaborative filtering Dziugaite and Roy, 2015), is presented in Figure 1.
Formally, we consider a set of n tasks T = {t 1 , . . . , t n } and a set of m languages L = {l 1 , . . . , l m }. The variational family of each task variable and language variable is a multivariate Gaussian with mean µ ∈ R h and diagonal covariance σ 2 I ∈ R h×h , where h is the dimensionality of the Gaussian. Consequently, t i ∼ N (µ t i , σ 2 t i I) and l j ∼ N (µ l j , σ 2 l j I), respectively. The space of parameters for all tasks and languages forms a tensor Θ ∈ R n×m×d , where d is the number of parameters of the largest model. 3 We denote with θ ij ∈ R d the parameters of the model for the i th language and the j th task. These parameters are also considered to be sampled from a multivariate Gaussian, whose mean µ ∈ R d and diagonal co-variance σ 2 I ∈ R d×d are functions of the corresponding task and language latent variables, i.e. f ψ (t i , l j ) and f φ (t i , l j ), respectively.
The likelihood of the classes y k for the k th sentence x k is equivalent to p(· | x k , θ tl ). In general, we will only possess the data for a subset S of the Cartesian product of all tasks and languages T × L = S ∪ U, and not for the unseen pairs U. However, as we estimate all task-language parameter vectors θ ij jointly, our model allows us to perform inference over the parameters for combinations in U as well. Intuitively, if data for NER in Basque and POS in Kazakh is provided, our model assigns posterior distributions over 2 language variables (Basque and Kazakh) and 2 task variables (NER and POS) based on such data. Afterwards, they can be recombined to make predictions for NER in Kazakh and POS in Basque. A summary of generative story of how we hypothesize the data 'came into being' is offered in Algorithm 1.
Algorithm 1 Generative model of the data. for

Inference and Prediction
In order to perform inference on the generative model outlined in Section 2, we take a Bayesian perspective. This enables smooth estimates of the posteriors for typically under-specified models such as neural networks (Garipov et al., 2018). In particular, we resort to stochastic Variational Inference based on batch-level optimization (Hoffman et al., 2013). In Section 3.1 we derive the Evidence Lower Bound (ELBO) for VI on the generative model of Figure 1. Then, in Section 3.2, we detail how we implement such a model through a neural network. Finally, in Section 3.3, we explore several co-variance structures for the Gaussian distributions of the latent variables, including a diagonal matrix and a low-rank factor matrix.

ELBO Derivation
In order to perform inference through the hierarchical Bayesian model described in Algorithm 1, we need to estimate the joint posterior over the latent variable sets θ, t, and l. The posterior given the observed data x is shown in Equation (1), which factorizes according to the independence assumption Y ⊥ ⊥ (T, L) | Θ ingrained in the graphical model of Figure 1: Unfortunately, term p(x) in the denominator of Equation (1) is intractable. Therefore, by integrating out the latent variables, we derive the lower bound for the log-probability of x in Equation (2). A way to interpret the ELBO is that we should minimize the variational gap, which is the KLdivergence between the true joint posterior and the approximate joint posterior. To see this we define the following: Note that Equation (2) contains a log-likelihood term that needs to be approximated through gradient descent, and 3 KL-divergence terms that have analytical solutions, provided that the true distribution is a multivariate Gaussian p(·) ∼ N (0, I):

Neural Model
Given the recent success of approaches for zeroshot cross-lingual transfer such as multilingual BERT (i.e., M-BERT; Pires et al., 2019), MULTIFIT (Eisenschlos et al., 2019) and XLM (Lample and Conneau, 2019), we adopt a similar neural network architecture with a a classifier stacked on top of an encoder. In particular, the encoder consists of a multi-layer Transformer (Vaswani et al., 2017) whose parameters are initialized with a model pretrained for masked language modeling and next sentence prediction on multiple languages. In this work, we treat such encoder as a black-box function x BERT ∈ R e = BERT(x) to encode tokens into multi-lingual contextualized representations. 4 On the other hand, the classifier is an affine layer. In other words, the probability over classes y is computed as y = softmax(Wx BERT + b). Crucially, parameter factorization takes place on the space of parameters for task-language-specific classifiers θ ij = {W ij , b ij } only, whereas the parameters of the encoder θ BERT are shared across all tasklanguage combinations and fine-tuned through maximum-likelihood optimization during training. Throughout this paper, by parameters we refer to those of the classifier, unless otherwise indicated.
In each training iteration, we randomly sample a task t i ∈ T and language l j ∈ L among seen combinations, 5 and randomly select a batch of examples from the dataset of such combination. Based on the generative model of Figure 1, the variational family of each task and language is a multivariate Gaussian. In order to allow the gradient to flow through, we generate samples for the latent variables through the re-parametrization trick (Blundell et al., 2015): means element-wise multiplication. The co-variance matrix must be non-negative. Hence, we obtain the diagonal standard deviation σ as ln(1 + exp(ρ)). We place a prior N (0, S t ) on each task and a prior N (0, S l ) on each language, where S t and S l are hyper-parameters.
The mean µ θ ij and (diagonal) log-variance ρ θ ij for the parameters θ ij for t i and l j are generated through a pair of deep feed-forward neural networks f ψ : R h → R d and f φ : R h → R d parametrized by ψ and φ, respectively, similarly to Kingma and Welling (2014). The networks f ψ and f φ take as input features based on the task and lan-guage samples, namely {t⊕l⊕t−l⊕t l}, where ⊕ stands for concatenation. As mentioned before, θ ij is a concatenation of a weight W ij ∈ R e×c i and a bias b ij ∈ R c i . Hence the number of parameters d = e × c i + c i depends on the dimensionality of the contextualized token embeddings e and the number of task-specific classes c i . We place a Gaussian prior on the parameters N (0, S θ ij ). S θ , as well as the number of hidden layers and the hidden size of f ψ and f φ , are hyper-parameters.We tie the parameters ψ and φ for all layers save the last for faster training.
Finally, the parameters θ ij are sampled from N (µ θ ij , ln(1 + exp(ρ θ ij )), again through the reparametrization trick. These are used as parameters for the affine classifier layer to generate a a distribution over classes y for every token x BERT . During training, we optimize the following parameters of the network: {µ t , ρ t , µ l , ρ l , ψ, φ}. We perform zero-shot predictions on examples from unseen task-language pairs by plugging in the mean of the latent variable estimates. 6

Low-rank Co-variance
The co-variance matrix of latent variables is often taken to be diagonal in Variational approximations (Kingma and Welling, 2014;Blundell et al., 2015), in order to make computations in the model feasible. In fact, a full-rank co-variance matrix Σ would: i) be too massive to store in memory; and ii) require O(h 2 ) time to sample from the distribution it defines, where h is the dimensionality of the Gaussian. In contrast, a diagonal co-variance matrix makes computation feasible with a complexity of O(h); this, however, comes at the cost of not letting parameters influence each other, and thus failing to capture their complex interactions.
To avoid these sub-optimal solutions, we turn to a factored co-variance matrix, in the spirit of Miller et al. (2017) and Ong et al. (2018). Such a factored structure offers a balanced solution, being parsimonious in the memory and time complexity while having non-zero non-diagonal entries. In particular, we factorize our co-variance matrices Σ t i and Σ l j as: which is the dot product of a matrix B ∈ R h×k of rank k with itself plus the diagonal entries σ 2 . The complexity of sampling from a multivariate normal distribution with such a co-variance matrix is O(kh), which is tractable for suitably low k as it does not requite to calculate the full matrix explicitly. In particular, through the re-parametrization trick, a sample takes the form: where ∈ R h , ζ ∈ R k , and both are sampled from N (0, I). Moreover, the KL divergence computation of the ELBO approximation in Equation (2) can be estimated analytically without explicitly calculating the low-rank co-variance matrix, provided that p(·) ∼ N (0, I), for the prior p(·): The last term can be estimated without computing the full matrix explicitly thanks to the generalization of the matrix determinant lemma, 7 which, applied to the factored co-variance structure, yields: 7 det(A + U V ) = det(I + V A −1 U ) · det(A). Note that the lemma assumes that A is invertible.
where I k ∈ R k .

Experimental Setup
Data. We select NER and POS tagging as our experimental tasks because their datasets encompass an ample and diverse sample of languages. In particular, we opt for WikiANN (Pan et al., 2017) for the NER task and Universal Dependencies 2.4 (UD, Nivre et al., 2019) for POS tagging. Our sample of languages results from the intersection of those available in WikiANN and UD. However, this sample is heavily biased towards the Indo-European family (Gerz et al., 2018). Instead, the selection should be: i) typologically diverse, to ensure that our model generalizes well; ii) focused on lowresource languages to recreate a realistic setting. Hence, we further filter the languages in order to make the sample more balanced. In particular, we exclude all resource-rich Indo-European languages.
In order to simulate a zero-shot setting, we hold out half of all possible task-language combinations in 2 distinct runs and regard them as unseen, while treating the others as seen combinations. The partition is performed in such a way that a held-out combination has data available for the same task in a different language, and for the same language in a different task. 8 We randomly split the WikiANN datasets into training, development, and test portions with a proportion of 80-10-10. We use the provided splits for Universal Dependencies; if the training set for a language is missing, we treat the test set as such when the language is held out, and as a training set when it is among the seen combinations. 9 Hyper-parameters. The multilingual M-BERT encoder is initialized with parameters pre-trained on masked language modeling and next sentence prediction on 104 languages (Devlin et al., 2019). 10 We opt for the cased BERT-BASE architecture, which consists of 12 layers with 12 attention heads and a hidden size of 768. As a consequence, this is also the dimension e of the each encoded Word-Piece unit. 11 The dimension h of the multivariate Gaussian for task and language latent variables is set to 100. Deep feed-forward networks f ψ and f φ have 6 layers with a hidden size of 400 for the first layer, 768 for the internal layers.
The expectations in Equation (2) are approximated through Monte Carlo estimation with 3 samples per batch during training. The KL terms are weighted with 1 |B| uniformly across training, where |B| is the number of mini-batches. 12 All the µ parameters are initialized with a random sample from N (0, 0.1), whereas ρ and B with U(0, 0.5), similarly to Stolee and Patterson (2019). We place a prior of N (0, I) over t, l, and θ. The factor covariance matrix has a rank k = 10 to fit in memory.
The maximum sequence length for inputs is limited to 250. The batch size is set to 8, and the best setting for the Adam optimizer (Kingma and Ba, 2015) was found to be an initial learning rate of 5 · 10 −6 and an = 10 −8 based on grid search. In order to avoid over-fitting, we perform early stopping with a patience of 10 and a validation frequency of 2.5K steps.
Baselines. In this work, we propose a Bayesian generative model over a structured parameter space for neural models. In particular, we explore an approximate inference scheme with diagonal covariance, hereby defined Parameter Factorization, and another with Factor Co-variance. As baselines, we consider two widespread approaches for cross-lingual transfer. Both of them are implemented sharing the BERT encoder across all languages while dedicating a private affine classifier to each task-language combination.
Transfer from the Nearest Source (NS) language selects the most compatible source to a target language in terms of similarity. In particular, the selection can be based on family membership (Zeman and Resnik, 2008;Cotterell and Heigold, 10 Available at github.com/google-research/ bert/blob/master/multilingual.md 11 A WordPiece is a sub-word unit obtained through BPE (Wu et al., 2016). 12 We found this weighting strategy to work better than annealing as proposed by Blundell et al. (2015). 2017; Kann et al., 2017), typological features (Deri and Knight, 2016), KL-divergence between part-ofspeech trigrams (Rosa and Zabokrtsky, 2015;Agić, 2017), tree edit distance of delexicalized dependency parses , or a combination of the above (Lin et al., 2019). In our work, for prediction on each held-out language, we use the classifier associated with the observed language with the highest cosine similarity in terms of typological features. These features are sourced from URIEL  and contain information about family, area, syntax, and phonology.
The second baseline is transfer from the Largest Source (LS) language, i.e. the language with most training examples, which is usually English. This approach has been adopted by several recent works on cross-lingual transfer (Conneau et al., 2018;Artetxe et al., 2019, inter alia). In our implementation, we always select the English classifier for prediction. In order to make this baseline comparable to our model, we adjust the number of English NER training examples to the sum of the examples available for all seen languages S. 13 It must be noted that, for both baselines, the number of parameters of each task-language-specific classifier is lower than of our proposed model. However, increasing the depth of such network is detrimental if the BERT encoder parameters are kept trainable, which we also verified in our experiments .

Zero-shot Transfer
Firstly, we present the results for zero-shot prediction of the generative model of the parameter space and the two approximate inference schemes (with diagonal co-variance PF and factor co-variance +FC). Table 1 summarizes the results on the two tasks of POS tagging and NER averaged across all languages. Our model outperforms both baselines with a large margin on both tasks. In particular, our model gains +4.49 in accuracy (+6.93%) for POS tagging and +7.73 in F1 score (+9.80%) for NER in average compared to transfer from the nearest source (NS), the strongest baseline.
More details about the individual results on each task-language combination are depicted in Figure 2 the results over 3 separate runs. Overall, we obtain improvements in 31/33 languages for NER and on 36/45 treebanks for POS tagging, which further supports the benefits of transferring both from tasks and languages. Selecting the best and worst performances of FC compared to NS, the strongest baseline, we report +30.08 in F1 score (+51%) for Kurmanji and -5.72 (-10.37%) for Amharic on NER; +29.91 in accuracy (+119.71%) for Uyghur and -1.72 (-12.07%) for Guaraní on POS tagging. Considering the baselines, the relative performance of LS versus NS is an interesting finding per se. LS largely outperforms NS on both POS tagging and NER. This shows that having more data is more informative than taking into account similarity based on linguistic properties. This finding contradicts the hypothesis formulated by (Rosa and Zabokrtsky, 2015;Cotterell and Heigold, 2017;Lin et al., 2019, inter alia) that related languages tend to be the most reliable source. We conjecture that this is due to the pre-trained multi-lingual BERT encoder, which effectively bridges the gap between unrelated languages.
Secondly, comparing the two approximate inference schemes, FC obtains a small but statistically significant improvement over PF in NER, whereas they achieve the same performance on POS tagging. This means that the posterior is modeled well enough by a spherical Gaussian, such that a richer co-variance structure is not needed.
Finally, we note that even for the best model (FC) there is a wide variation in the scores for the same task across languages. POS tagging accuracy ranges from 12.56 ± 4.07 in Guaraní to 86.71 ± 0.67 in Galician, and NER F1 scores range from 49.44±0.69 in Amharic to 96.20±0.11 in Upper Sorbian. Part of this variation is explained by the fact that the multilingual BERT encoder is not pre-trained in a subset of these languages (which includes Amharic, Guaraní, Uyghur, and Assyrian Neo-Aramaic). Another cause of variance is more straightforward: the scores are expected to be lower in languages for which we have less training examples in the seen task-language combinations (e.g., Yoruba, Wolof, Armenian).

Visualization of the Learned Posteriors
The approximate posteriors of the latent variables can be visualized in order to study the learned representations for languages. Previous works (Johnson et al., 2017;Östling and Tiedemann, 2017;Malaviya et al., 2017;Bjerva and Augenstein, 2018) induced point estimates of language representations from artificial tokens concatenated to every input sentence, or from the aggregated values of the hidden state of a neural encoder. The information contained in such representations depends on the task (Bjerva and Augenstein, 2018), but mainly reflects the structural properties of each language (Bjerva et al., 2019). In our work, due to the estimation procedure, languages are represented by full distributions rather than point estimates. By inspecting the learned representations, language similarities appear to follow higher-level properties of languages, rather than structural properties. This is most likely due to the fact that parameter factorization takes place after the multi-lingual BERT encoding, which blends the structural differences across languages. A fair comparison with previous works without such encoder is left for future investigation.
326 1(5 Figure 4: Entropy of the predicted distributions over classes for each test example. The higher the entropy, the more uncertain the prediction.
As an example, consider two pairs of languages from two distinct families: Yoruba and Wolof are Niger-Congo from the Atlantic-Congo branch, Tamil and Telugu are Dravidian. We take 1,000 samples from the approximate posterior over the latent variables for each of these languages. In particular, we focus on the variational scheme with a low-rank co-variance structure Gaussian. Subsequently, we reduce the dimensionality of each sample to 4 through PCA, 14 and we plot the density along each resulting dimension in Figure 3. As it is evident, density areas of each dimension do not necessarily overlap between members of the same family. Hence, the learned representations do not depend on genealogical properties. We leave it to future work to probe which information they contain instead.

Entropy of the Predictions
A notable problem of point estimate methods is their tendency to assign most of the probability mass to a single class even in scenarios with high uncertainty. Zero-shot transfer is one of such sce-14 Note that the dimensionality reduced samples are also Gaussian, since PCA is a linear method. narios, because it involves drastic distribution shifts in the data (Rabanser et al., 2019). A key advantage of Bayesian inference, instead, is marginalization over parameters, which yields smoother predictions (Kendall and Gal, 2017;Wilson, 2019).
We run an analysis on predictions based on (approximate) Bayesian model averaging. First, we randomly sample 800 examples from each test set of a task-language combination. For each example, we predict a distribution over classes Y through model averaging based on 10 samples from the posteriors. We then measure the prediction entropy of each example, i.e. H(p) = − |Y | y p(Y = y) ln p(Y = y), whose plot is shown in Figure 4.
Entropy is a measure of uncertainty. Intuitively, the uniform categorical distribution (maximum uncertainty) has the highest entropy, whereas if the whole probability mass falls into a single class (maximum confidence), then the entropy H = 0. 15 As it emerges from Figure 4, predictions in certain languages tend to have higher entropy on average, such as in Amharic, Guaraní, Uyghur, or Assyrian Neo-Aramaic. This aligns well with the perfor-mance metrics in Figure 2. In practice, languages with low performances tend to display high entropy in the predictive distribution, as expected.

Related Work
Data Matrix Factorization. Although we are the first to propose a factorization of the parameter space for unseen combinations of tasks and languages, the factorization of data for collaborative filtering and social recommendation is an established research area. In particular, the missing values in sparse data structures such as user-movie review matrices can be filled via probabilistic matrix factorization (PMF) through a linear combination of user and movie matrices Ma et al., 2008;Shan and Banerjee, 2010, inter alia) or through neural networks (Dziugaite and Roy, 2015). Inference for PMF can be carried out through MAP inference (Dziugaite and Roy, 2015), Markov Chain Monte Carlo (MCMC;  or stochastic VI (Stolee and Patterson, 2019). Contrary to the prior work, we perform factorization on a latent variable (task-language parameters) rather than an observed one (data), which requires a different model.

Contextual
Parameter Generation. Our model is reminiscent of the idea that parameters can be conditioned on language representations, as proposed by Platanios et al. (2018). However, since this approach is limited to a single task and a joint learning setting, it is not suitable for cross-task transfer and zero-shot predictions.
Neural Bayesian Methods are especially suited for cross-lingual transfer learning, but so far they have found only limited application in this research area. Firstly, they incorporate priors over parameters: Ponti et al. (2019b) constructed a prior imbued with universal linguistic knowledge for zero-and few-shot character-level language modeling. Secondly, they avoid the risk of over-fitting and take into account parameter uncertainty through model averaging. For instance, Shareghi et al. (2019) and Doitch et al. (2019) use a perturbation model to sample high-quality and diverse solutions for structured prediction in cross-lingual parsing.

Conclusion and Future Work
The main contribution of our work is a Bayesian generative model of the space of neural parameters which can be factorized according to the combina-tion of languages and tasks. We performed inference through stochastic Variational Inference and evaluated our model on zero-shot prediction for unseen task-language combinations in two tasks: named entity recognition (NER) and part-of-speech (POS) tagging, across a typologically diverse set of 33 languages. Based on these results, we conclude that leveraging the information from tasks and languages simultaneously is superior to model transfer from English (relying on more abundant intask data in the source language) or from the most typologically similar language (relying on prior information on language similarity). On average, we report improvements of 4.49 in POS tagging accuracy and 7.73 in NER F1 score over the strongest baseline. As a consequence, our approach holds promise to alleviating data paucity issues for a wide spectrum of languages and tasks.
In the future, we will port a similar approach to multilingual tasks beyond sequence labelling tasks, such as Natural Language Inference (Conneau et al., 2018) and Question Answering Lewis et al., 2019). Moreover, one exciting research path is extending the framework of parameter space factorization to take into account also multiple modalities (e.g., speech, text, vision).