## Abstract

Orthographic similarities across languages provide a strong signal for unsupervised probabilistic transduction (decipherment) for closely related language pairs. The existing decipherment models, however, are not well suited for exploiting these orthographic similarities. We propose a log-linear model with latent variables that incorporates orthographic similarity features. Maximum likelihood training is computationally expensive for the proposed log-linear model. To address this challenge, we perform approximate inference via Markov chain Monte Carlo sampling and contrastive divergence. Our results show that the proposed log-linear model with contrastive divergence outperforms the existing generative decipherment models by exploiting the orthographic features. The model both scales to large vocabularies and preserves accuracy in low- and no-resource contexts.

## 1. Introduction

Word-level translation models are typically learned by applying statistical word alignment algorithms on large-scale bilingual parallel corpora (Brown et al. 1993). Building a parallel corpus, however, is expensive and time-consuming. As a result, parallel data are limited or even unavailable for many language pairs. In the absence of a sufficient amount of parallel data, the accuracy of standard word alignment algorithms drops significantly. This is also true of supervised neural methods: Even with hundreds of thousands of parallel training sentences, neural methods only achieve modest results (Zoph et al. 2016). Low- and no-resource languages generally do not have parallel corpora, and even their monolingual corpora tend to be small. However, these monolingual corpora can often be downloaded from the Internet, and are much easier to obtain or produce than parallel corpora. Leveraging useful information from monolingual corpora can be extremely helpful for learning translation models for low- and no-resource language pairs.

Decipherment algorithms (so-called because of the assumption that one language is a cipher for the other) aim to exploit such monolingual corpora in order to learn translation model parameters, when parallel data are limited or unavailable (Koehn and Knight 2000; Ravi and Knight 2011; Dou, Vaswani, and Knight 2014). The key intuition is that similar words and *n*-grams tend to have similar distributional properties across languages. For example, if a bigram appears frequently in the monolingual source corpus, its translation is likely to appear frequently in the monolingual target corpus, and vice versa. This is especially true when the corpora share similar topics and context. Furthermore, for many such language pairs, we observe similar monotonic word ordering—that is, the translation of a bigram is often the same as the concatenation of the translations of individual unigrams (consider the shared use of postnominal adjectives in the French *maison bleu* and Spanish *casa azul*). Although this certainly is not always true, we assume that it is common enough to provide a useful signal. The goal of decipherment algorithms is to leverage such statistical similarities across languages, and effectively learn word-level translation probabilities from monolingual data.

Existing decipherment methods are predominantly based on probabilistic generative models (Koehn and Knight 2000; Ravi and Knight 2011; Dou and Knight 2012; Nuhn and Ney 2014). These models primarily focus on the statistical similarities between the *n*-gram frequencies in the source and the target language, and rely on the expectation maximization (EM) algorithm (Dempster, Laird, and Rubin 1977) or its faster approximations. However, there can be many other types of statistical and linguistic similarities across languages beyond *n*-gram frequencies (similarities in spelling, word-length distribution, syntactic structure, etc.). Unfortunately, existing generative models do not allow incorporating such a wide range of linguistically motivated features. Previous research has shown the effectiveness of incorporating linguistically motivated features for many different unsupervised learning tasks, such as unsupervised part-of-speech induction (Haghighi and Klein 2006; Berg-Kirkpatrick et al. 2010), word alignment (Dyer et al. 2011; Ammar, Dyer, and Smith 2014), and grammar induction (Berg-Kirkpatrick et al. 2010).

Many pairs of related languages share vocabulary or grammatical structure due to borrowing or inheritance: the English *aquatic* and Spanish *agua* share the Latin root *aqua*, and the English *beige* was borrowed from French. As a result, orthographic features provide crucial information for determining word-level translations for closely related language pairs. Church (1993) leveraged orthographic similarity for character alignment. Haghighi, Berg-Kirkpatrick, and Klein (2008) proposed a generative model for inducing a bilingual lexicon from monolingual text by exploiting orthographic and contextual similarities among the words in two different languages. The model proposed by Haghighi et al. learns a one-to-one mapping between the words in two languages by analyzing type-level features only, while ignoring the token-level *n*-gram frequencies. We propose a decipherment model that unifies the type-level feature-based approach of Haghighi et al. with token-level EM-based approaches such as Koehn and Knight (2000) and Ravi and Knight (2011).

In addition to orthographic similarity, we also often observe similarity in the distribution of word lengths across different languages. Linguists have long noted the relationship between word frequency and length (Zipf 1949), so the tendency of words and their translations to have similar frequencies (Rapp 1995) may apply to length as well. Our feature-rich log-linear model can easily incorporate such length-based similarity features.

One of the key challenges with the proposed latent variable log-linear model is the high computational complexity of training, as it requires normalizing globally via summing over all possible observations and latent variables. As a result, an exact implementation is impractical even for the moderate vocabulary size of most low-resource languages. To address this challenge, we perform approximate inference using Markov chain Monte Carlo (MCMC) sampling for scalable training of the log-linear decipherment models. We present a series of increasingly scalable approximations, each most suitable for a different amount of available data. They are applicable in contexts ranging from no-resource languages (such as “lost” languages, a context considered by Snyder, Barzilay, and Knight [2010]) to languages with a modest amount of data that is still insufficient for state-of-the-art unsupervised methods based on word embeddings.

The main contributions of this article are as follows.

- •
We propose a feature-based decipherment model for low- and no-resource languages that combines both type-level orthographic features and token-level distributional similarities. Our proposed model outperforms the existing EM-based decipherment models.

- •
We apply three different MCMC sampling strategies for scalable training and compare them in terms of running time and accuracy. Our results show that contrastive divergence (Hinton 2002)–based MCMC sampling can dramatically improve the speed of the training, while achieving comparable accuracy.

- •
We extend the contrastive divergence method to sample entire sentences, rather than bigram pairs, allowing more context to be used in reconstructing latent translations.

- •
Finally, we extend the model to exploit parallel as well as monolingual data, for situations in which limited amounts of parallel data may be available.

The remainder of the article is organized as follows. In Section 2, we introduce the general problem formulation for monolingual decipherment, and present our notations. We discuss the background literature on different decipherment models for machine translation in Section 3. Section 4 describes the proposed feature-based decipherment model. A detailed discussion of MCMC sampling–based approximations follows in Section 5. We extend the fully monolingual model to exploit parallel data in Section 6. Our orthographic features are described in Section 7. Finally, we present our detailed results in Section 8 and conclude with our findings and discuss our future work in Section 9.

## 2. Problem Formulation

Given a source text and an independent target corpus , our goal is to translate the source text by learning the mapping between the words in the source and the target language. Although the sentences in the source and target corpus are independent of each other, there exist distributional and lexical similarities among the words of the two languages. We aim to automatically learn the translation probabilities *p*(*f*|*e*) for all source words *f* and target words *e* by exploiting the similarities between the bigrams in and .

As a simplification step, we break down the sentences in the source and target corpus as a collection of bigrams. Let contain a collection of source bigrams *f*_{1}*f*_{2}, and contain a collection of target bigrams *e*_{1}*e*_{2}. Let the source and target vocabulary be *V*_{F} and *V*_{E}, respectively. Let *N*_{F} and *N*_{E} be the number of unique bigrams in and , respectively. We assume that the corpus is an encrypted version of a plaintext in the target language. Each source word *f* ∈ *V*_{F} is obtained by substituting one of the words *e* ∈ *V*_{E} in the plaintext. However, the mappings between the words in the two languages are unknown, and are learned as latent variables. Table 1 summarizes the notations and symbols used in this article.

Symbol . | Meaning . |
---|---|

N_{F} | Number of unique source bigrams |

N_{E} | Number of unique target bigrams |

V_{F} | Source vocabulary |

V_{E} | Target vocabulary |

V | max(|V_{F}|, |V_{E}|) |

N | Number of samples |

K | Beam size for precomputed lists |

ϕ | Unigram level feature function |

Φ | Bigram level feature function: Φ = ϕ_{1} + ϕ_{2} |

Symbol . | Meaning . |
---|---|

N_{F} | Number of unique source bigrams |

N_{E} | Number of unique target bigrams |

V_{F} | Source vocabulary |

V_{E} | Target vocabulary |

V | max(|V_{F}|, |V_{E}|) |

N | Number of samples |

K | Beam size for precomputed lists |

ϕ | Unigram level feature function |

Φ | Bigram level feature function: Φ = ϕ_{1} + ϕ_{2} |

## 3. Background Research

*f*

_{1}

*f*

_{2}in is generated by first generating a target bigram

*e*

_{1}

*e*

_{2}according to the target language model, and then substituting

*e*

_{1}and

*e*

_{2}with

*f*

_{1}and

*f*

_{2}, respectively. The generative process is typically modeled via a hidden Markov model (HMM), as shown in Figure 1(a). The target bigram language model

*p*(

*e*

_{1}

*e*

_{2}) is trained from the given monolingual target corpus . The unknown translation probabilities

*p*(

*f*|

*e*) are learned by maximizing the likelihood of the observed source corpus :where

*e*

_{1}and

*e*

_{2}are the latent variables, indicating the target words in

*V*

_{E}corresponding to

*f*

_{1}and

*f*

_{2}, respectively. The log-likelihood function with latent variables is non-convex, and several methods have been proposed for maximizing it. In this work, we seek to combine a number of them for improved performance.

### 3.1 Expectation Maximization (EM)

The expectation maximization (EM) (Dempster, Laird, and Rubin 1977) algorithm has been widely applied for solving the decipherment problem (Knight and Graehl 1998; Knight and Yamada 1999; Koehn and Knight 2000). In the E-step, for each source bigram *f*_{1}*f*_{2}, we estimate the expected counts of the latent variables *e*_{1} and *e*_{2} over all the target words in *V*_{E}. In the M-step, the expected counts are normalized to obtain the translation probabilities *p*(*f*|*e*). The computational complexity of the EM algorithm is *O*(*N*_{F}*V*^{2}) and the memory complexity is *O*(*V*^{2}), where *N*_{F} is the number of unique bigrams in and *V* = max(|*V*_{F}|, |*V*_{E}|). As a result, the regular EM algorithm does not scale well to large vocabulary sizes, both in terms of running time and memory.

To address this challenge, Ravi and Knight (2011) proposed the iterative EM algorithm, which starts with the *K* most frequent words from and and performs EM-based decipherment. Next, the source and target vocabularies are iteratively extended by *K* new words, while pruning low-probability entries from the probability table. The computational complexity of each iteration becomes *O*(*N*_{F}*K*^{2}).

### 3.2 Bayesian Decipherment Using Gibbs Sampling

Ravi and Knight (2011) proposed a Gibbs sampling–based Bayesian decipherment strategy. For each observed source bigram *f*_{1}*f*_{2}, the Gibbs sampling approach starts with an initial target bigram *e*_{1}*e*_{2}, and alternately fixes one of the target words and replaces the other with a randomly chosen sample. When *e*_{1} is fixed, a new sample *e*_{2}^{new} is drawn with probability proportional to *p*(*e*_{1}*e*_{2}^{new})*p*(*f*_{2}|*e*_{2}^{new}). Next, we fix *e*_{2} and sample *e*_{1}^{new}, and continue alternating until *n* samples are collected. Bayesian decipherment reduces memory consumption via Gibbs sampling. The probability table remains sparse, because only a small number of word pairs (*f*, *e*) will be observed together in the samples.

### 3.3 Slice Sampling

Each Gibbs sampling operation requires estimating the probability of choosing every target word *e* ∈ *V*_{E}, which requires *O*(*V*) operations. To address this issue, Dou and Knight (2012) proposed a slice sampling approach with precomputed top-*K* lists for *p*(*e*|*f*) and *p*(*e*_{1}*e*_{2}). Slice sampling involves selecting a threshold *T* between 0 and the probability of the current sample, and then uniformly picking a random new sample from all candidates with probability greater than *T*. Using sorted top-*K* lists makes this faster than Gibbs sampling on average, although sometimes the top-*K* lists fail to provide all the candidates, in which case the method has to fall back to sampling from the entire vocabulary, which requires *O*(*V*) operations.

### 3.4 Beam Search

Nuhn and colleagues (Nuhn, Schamper, and Ney 2013; Nuhn and Ney 2014; Nuhn, Schamper, and Ney 2015) showed that beam search can significantly improve the speed of EM-based decipherment, while providing comparable or even slightly better accuracy. Beam search prunes less-promising latent states by maintaining two constant-sized beams, one for the translation probabilities *p*(*f*|*e*) and one for the target bigram probabilities *p*(*e*_{1}*e*_{2})—reducing the computational complexity to *O*(*N*_{F}). Furthermore, it saves memory because many of the word pairs (*f*, *e*) are never considered because they are not in the beam.

### 3.5. Feature-Based Generative Models

Haghighi, Berg-Kirkpatrick, and Klein (2008) proposed a canonical correlation analysis–based model for automatically learning the mapping between the words in two languages from monolingual corpora only. They used orthographic information (character substring features) and context information (co-occurrence statistics within a window) for their features; we use edit distance as our orthographic information, and we operate on bigrams for our context information. Although their model uses an EM-style algorithm, it does not iterate over the corpus data.

Ravi (2013) proposed a Bayesian decipherment model based on hash sampling, which takes advantage of feature-based similarities between source and target words. However, the feature representation was not integrated with their decipherment model, and was only used for efficiently sampling candidate target translations for each source word. Furthermore, the feature-based hash sampling included only contextual features (in the form of *n*-gram co-occurrence information), and did not consider orthographic features. In contrast, our log-linear model integrates both type-level orthographic features and token-level bigram frequencies.

### 3.6 Embedding-Based Models

Recent work has explored the possibility of finding a mapping between word embedding spaces using monolingual data. Artetxe, Labaka, and Agirre (2017) use a small set of seed translations to learn this mapping. Zhang et al. (2017) do not use seed translations, but do use document-aligned Wikipedia data, and only consider words appearing at least 1,000 times. Both methods train word embeddings using data sets with millions of words, limiting their applicability to low resource languages, even more so for languages with the small amount of data that we experiment with in this work.

## 4. Feature-Based Decipherment

*f*

_{1}

*f*

_{2}and corresponding latent target bigram

*e*

_{1}

*e*

_{2}. For each source word

*f*∈

*V*

_{F}, we have a latent variable

*e*∈

*V*

_{E}indicating the corresponding target word. The joint probability distribution is:where

**Φ**(

*f*

_{1}

*f*

_{2},

*e*

_{1}

*e*

_{2}) is the feature function for the given source and the target bigrams,

**w**is the model parameters, and

*Z*

_{w}is the normalization term. We assume that the feature function decomposes into features of aligned word pairs (motivated by the observation in Section 1 that word order is generally preserved across bigram translations):The features ϕ, which will be described in more detail in Section 7, include features for orthographic similarity as well as indicator features ϕ

_{f,e}for each word pair. The normalization term is defined as:This gives our model the conditional random field (CRF)–like dependency structure shown in Figure 1. In our model, however, the term

*p*(

*e*

_{1},

*e*

_{2}) is estimated from a monolingual target corpus, and is held constant when training the weights

**w**.

### 4.1 Estimating Forced Expectation ()

*Z*(

*f*

_{1}

*f*

_{2}) is the normalization term given

*f*

_{1}

*f*

_{2}:For each observed

*f*

_{1}

*f*

_{2}∈ , we sum over all possible

*e*

_{1}

*e*

_{2}∈

*V*

_{E}

^{2}, which requires

*O*(

*N*

_{F}

*V*

^{2}) computations.

### 4.2 Estimating Full Expectation ()

## 5. MCMC Sampling for Faster Training

The overall computational complexity of estimating the exact gradient is *O*(*N*_{F}*V*^{2} + *V*^{4}), which is impractical even for a modest-sized vocabulary. We apply several MCMC sampling methods to approximate the forced and full expectations. We first propose using Gibbs sampling for both the forced and full expectation terms. We then propose a faster approximation using independent Metropolis Hastings sampling for just the forced expectation term. We then propose an even faster approximation using contrastive divergence for estimating both terms. We then extend this method to sample at the sentence level rather than at the bigram level, with the goal of increasing accuracy.

Computation times for the methods presented are summarized in Table 2.

Method . | Complexity . |
---|---|

EM | O(N_{F}V^{2}) |

Feature HMM | O(N_{F}V^{2}) |

Log-linear/MRF Exact | O(N_{F}V^{2} + V^{4}) |

Log-linear + Gibbs | O(N_{F}VN + VN^{2}) |

Log-linear + IMH | O(N_{F}N + VN^{2}) |

Log-linear + CD | O(N_{F}N + VN^{2}) |

Log-linear + CD, Sentence | O>(||N) |

Method . | Complexity . |
---|---|

EM | O(N_{F}V^{2}) |

Feature HMM | O(N_{F}V^{2}) |

Log-linear/MRF Exact | O(N_{F}V^{2} + V^{4}) |

Log-linear + Gibbs | O(N_{F}VN + VN^{2}) |

Log-linear + IMH | O(N_{F}N + VN^{2}) |

Log-linear + CD | O(N_{F}N + VN^{2}) |

Log-linear + CD, Sentence | O>(||N) |

### 5.1 Gibbs Sampling

#### 5.1.1 Gibbs Sampling for Forced Expectation.

Rather than summing over all target bigrams *e*_{1}*e*_{2}, we approximate the forced expectation by taking *N* samples of *e*_{1}*e*_{2} for each observed *f*_{1}*f*_{2}, and take an average of the features for these samples. For each observed *f*_{1}*f*_{2}, the following steps are taken:

- •
Start with an initial target bigram

*e*_{1}*e*_{2}. - •
- •
Next, fix

*e*_{1}and draw a new sample*e*_{2}similarly according to*P*(*e*_{2}|*e*_{1},*f*_{1}*f*_{2}), and continue sampling*e*_{1}and*e*_{2}alternately until*N*samples are drawn.

*O*(

*V*) operations, as we need to estimate the normalization term

*Z*

_{gibbs}. The computational complexity of estimating the forced expectation becomes

*O*(

*N*

_{F}

*VN*), which is expensive as

*V*can be large (and

*N*

_{F}generally scales with

*V*).

#### 5.1.2 Gibbs Sampling for Full Expectation.

*N*source bigrams

*f*

_{1}

*f*

_{2}from our model. The Gibbs sampling procedure is:

- •
Start with an initial random

*f*_{1}*f*_{2}. - •
- •
Next fix

*f*_{1}and sample*f*_{2}according to*P*(*f*_{2}|*f*_{1}). Continue alternating until*N*samples are drawn.

*p*(

*f*

_{1}|

*f*

_{2}) is

*O*(

*V*

^{3}), resulting in the computational complexity

*O*(

*V*

^{3}

*N*), which is impractical for all but the smallest vocabularies. However, rather than summing over all possible

*e*

_{1}

*e*

_{2}, we can approximate via sampling. For each

*f*

_{1}

*f*

_{2}, we first sample

*N*samples

*e*

_{1}

*e*

_{2}according to

*p*(

*e*

_{1}

*e*

_{2}). Let

*S*be the set of

*N*samples of target bigrams. Next, we approximate

*p*(

*f*

_{1}|

*f*

_{2}) aswhere . This reduces the computational complexity to

*O*(

*VN*

^{2}).

### 5.2 Independent Metropolis Hastings (IMH)

In our experiments, the Gibbs sampling for our log-linear model was still somewhat slow, and will not scale well to larger experimental settings. To address this challenge, we apply IMH sampling, which relies on a proposal distribution and does not require normalization. However, finding an appropriate proposal distribution can sometimes be challenging, as it needs to be close to the true distribution for faster mixing and must be easy to sample from.

For the forced expectation, one possibility is to use the bigram language model *p*(*e*_{1}*e*_{2}) as a proposal distribution. However, the bigram language model did not work well in practice. Because *p*(*e*_{1}*e*_{2}) does not depend on *f*_{1}*f*_{2}, it resulted in slow mixing and exhibited a bias toward highly frequent target words.

*p*(

*e*

_{1}

*e*

_{2}|

*f*

_{1}

*f*

_{2}) as our proposal distribution. To simplify sampling, we assume

*e*

_{1}and

*e*

_{2}to be independent of each other for any given

*f*

_{1}

*f*

_{2}. Therefore, the proposal distribution

*q*(

*e*

_{1}

*e*

_{2}|

*f*

_{1}

*f*

_{2}) =

*q*

_{u}(

*e*

_{1}|

*f*

_{1})

*q*

_{u}(

*e*

_{2}|

*f*

_{2}), where

*q*

_{u}(

*e*|

*f*) is a probability distribution over target unigrams for a given source unigram. We define

*q*

_{u}(

*e*|

*f*) as follows:where

*p*

_{b}is a small back-off probability with which we fall back to the uniform distribution over target unigrams. The other term

*q*

_{s}(

*e*|

*f*) is a distribution over the target words

*e*for which the weight

*w*

_{f,e}of the word pair feature ϕ

_{f,e}is non-zero:Here,

*Z*

_{imh}is a normalization term over all the

*e*such that

*w*

_{f,e}≠ 0. The weight vector

**w**is sparse, as only a small number of translation features (

*f*,

*e*) (Section 7) are observed during sampling. Furthermore, we update

*q*

_{s}only once every five iterations of gradient descent.

For each *f*_{1}*f*_{2} ∈ , we take the following steps during sampling:

- •
Start with an initial English bigram: 〈

*e*_{1}*e*_{2}〉^{0}. - •
Let the current sample be 〈

*e*_{1}*e*_{2}〉^{i}. Next, sample 〈*e*_{1}*e*_{2}〉^{i+1}from the proposal distribution*q*(*e*_{1}*e*_{2}|*f*_{1}*f*_{2}). - •

*O*(

*N*

_{F}

*N*),

^{1}which is significantly less than the complexity of

*O*(

*N*

_{F}

*VN*) in the case of Gibbs sampling. However, we could not apply IMH while estimating the full expectation, as finding a suitable proposal distribution is more complicated. Therefore, the overall complexity remains:

*O*(

*N*

_{F}

*N*+

*VN*

^{2}).

### 5.3 Contrastive Divergence-Based Sampling

The main reason for the slow training of the proposed log-linear MRF model is the high computational cost of estimating the partition function *Z*_{g} when estimating the full expectation. A similar problem arises while training deep neural networks. An increasingly popular technique to address this issue is to perform contrastive divergence (Hinton 2002), which allows us to avoid estimating the partition function.

For each observed source bigram *f*_{1}*f*_{2} ∈ , contrastive divergence sampling works as follows:

- •
Sample a target bigram

*e*_{1}*e*_{2}according to the distribution*p*(*e*_{1}*e*_{2}|*f*_{1}*f*_{2}). We perform this step using IMH, as discussed in the previous section. - •
Sample a reconstructed source bigram 〈

*f*_{1}*f*_{2}〉^{recon}by sampling from the distribution*p*(*f*_{1}*f*_{2}|*e*_{1}*e*_{2}), again via IMH.

### 5.4 Sentence-Level Sampling

Up to this point, we have considered parallel source/target bigram pairs in isolation, but it may be helpful to take larger contexts into account in decipherment. In this section, we extend the sampling procedures to resample an entire source/target sentence pair at each iteration. Although our features are functions of individual bigrams, sentence-level sampling gives us the benefit of looking at an individual word’s left and right context when considering alternative translations. More generally, the HMM-like feature structure also allows information to flow through the entire sentence from beginning to end.

*f*,

*e*) of a pair of French and English words. We use the notationto denote the feature vector for an entire sentence pair; we will assume that the French and English sequences have the same length. Analogously to Equation (4), the gradient of the log-likelihood can be written as the difference between a forced expectation and full expectation, now at the level of sentences rather than bigrams:

We estimate these two terms with a sentence-level sampling algorithm based on contrastive divergence. At a high level, given an observed French sentence, it samples a hidden English sequence according to *p*(**e**|**f**) in order to estimate the forced expectation term of the update, and then samples a French sentence according to *p*(**f**|**e**) to estimate the full expectation, as shown in Algorithm 1. However, because the individual English words are not independent, due to the bigram language model, the sampling of *p*(**e**|**f**) is itself broken down into a sequence of Gibbs sampling steps, sampling one word at a time while holding the others fixed, as shown in Algorithm 2. This process is iterated to produce a total of *N* samples of the English sequence, with each sample initialized with the previous sample (line 4 of Algorithm 1). The entire process is initialized with a Viterbi decoding of the best English sequence under the current parameters (line 2 of Algorithm 1). Empirically, we found that this initialization sped up training by reducing the number of samples necessary.

## 6. Exploiting Parallel Data

We now turn to consider the setting in which a small amount of parallel data may be available for the two languages in question, along with a larger amount of monolingual data for each of the languages. Our hope is that even a small amount of parallel data may allow the model to learn the correspondence between very frequent words, such as function words. For many language pairs, including French–English, function words do not exhibit orthographic similarity, despite the high proportion of orthographically similar content words. Reducing errors among function words that are observed in the parallel data may help prevent the decipherment model from aligning words that are spuriously similar “false friends” when analyzing the monolingual data.

Mathematically, we wish to define a single probability model that can apply to both parallel and monolingual data, and choose feature weights **w** that optimize the total likelihood of the parallel and monolingual data together. Probability models for word alignment of parallel data are one of the original problems studied in statistical machine translation (Brown et al. 1993). We wish to apply our log-linear feature-based model to parallel data, making the problem setting similar to that of Dyer et al. (2011). For simplicity, we assume a bag of words model that does not take into account the order of the words in the English sentence, resulting in a log-linear feature-based version of IBM Model 1.

We implement training for this model by modifying our sentence-level contrastive divergence method described in Section 5.4. We constrain the sampling of the English words **e**_{forced} used to approximate the term by allowing only English words that appear in the English side of the parallel sentence pair. We sample a separate sequence of English words **e** for the term as before. The algorithm for parallel data is shown in Algorithm 3, where the English side of the parallel sentence pair is provided as an additional argument . This set of words is used to constrain the choices of the Viterbi initialization of **e**_{forced} (line 2). The observed English sentence is also used to constrain the choice of sample in Algorithm 4; the indicator function *I*(*e*_{i} ∈ ) ensures that any English words not present in have zero probability of being sampled.

Although our algorithm does not take into account the order of the observed English sentence , we note that, unlike the training procedure for IBM Model 1, our algorithm does take advantage of the English bigram language model in constructing the alignment between the English and French sentences. Thus, whereas an English word pair is not more likely to align to adjacent French words if it is adjacent in the English sentence, it is more likely to align to adjacent French words if the English words are frequently adjacent in general; the motivation is that, as mentioned in Section 1, *n*-gram frequencies between the two languages are assumed to be similar. This is beneficial both because it provides the model with more information than is available to IBM Model 1, and because it allows us to use a unified probability model for parallel and monolingual data.

## 7. Feature Design

We included the following unigram-level features:

- •
*Translation Features:*Each (*f*,*e*) word pair, where*f*∈*V*_{F}and*e*∈*V*_{E}, is a potential feature in our model. Although there are*O*(*V*^{2}) such possible features, we only include the ones that are observed during sampling. Therefore, our feature weights vector**w**is sparse, with most of the entries equal to zero. - •
*Orthographic Features:*We incorporated an orthographic feature based on the normalized edit-distance between two words. The normalized edit distance between a word pair (*f*,*e*) is defined as follows:where*ED*(*e*,*f*) is their string edit distance (minimum total number of required insertions, deletions, and substitutions) and |*e*| and |*f*| represent their lengths. When normalized edit distance between two words is larger than a threshold, it usually indicates that the words are orthographically dissimilar, and the exact value of the normalized edit distance does not carry much information. Based on this intuition, we chose our orthographic features to be boolean-valued features. For a word pair (*f*,*e*), the orthographic feature is triggered if the normalized edit distance*NED*(*f*,*e*) is less than a threshold (set to 0.3 in our experiments). - •
*Length Difference:*Because source words and their target translations often tend to have similar lengths, we added the absolute value of their length difference as a feature.

The set of features can further be extended by including context window–based features (Haghighi, Berg-Kirkpatrick, and Klein 2008; Ravi 2013) and topic model and word embedding features. Character rewriting features could be used to model when the two languages use different characters for the same sound; these could be coupled with the edit distance feature to approximate phonetic distance. Additionally, in this work we did not perform any character normalization; a simple extension of this system could treat similar characters (é, e, è) as identical for edit distance calculations.

## 8. Experiments and Results

### 8.1 Data Sets

We experimented with two closely related language pairs: (1) Spanish and English and (2) French and English. For Spanish–English, we experimented with a subset of the OPUS Subtitle corpus (Tiedemann 2009). For French–English, we used the Hansard corpus (Brown, Lai, and Mercer 1991), containing parallel French and English text from the proceedings of the Canadian Parliament. In order to have a non-parallel set-up, we extracted monolingual text from different sections of the French and English text. A detailed description of the two data sets is now provided.

**OPUS Subtitle Data set:** the OPUS data set is a smaller pre-processed subset of the original larger OPUS Spanish–English parallel corpora. The data set consists of short sentences in Spanish and English, each of which is a movie subtitle. The same data set has been used in several previous decipherment experiments (Ravi and Knight 2011; Nuhn and Ney 2014). We use the first 9,885 French sentences and the second 9,885 English sentences.

**Hansard Data set:** The Hansard data set contains parallel text from the Canadian Parliament Proceedings. We experimented with two data sets:

- •
**Hansard-100:**The French text consists of the first 100 sentences and the English text consists of the second 100 sentences. - •
**Hansard-1000:**The French text consists of the first 1,000 sentences and the English text consists of the second 1,000 sentences.

Table 3 provides some statistics on the three data sets used in our experiments. The OPUS and Hansard-100 data sets have relatively smaller vocabularies, whereas the Hansard-1000 data set has a significantly larger vocabulary.

Data set . | Num. Sentences . | |V_{E}|
. | |V_{F}|
. |
---|---|---|---|

OPUS | 9.89K (997 unique) | 375 | 530 |

Hansard-100 | 100 | 358 | 371 |

Hansard-1000 | 1,000 | 2,957 | 3,082 |

Data set . | Num. Sentences . | |V_{E}|
. | |V_{F}|
. |
---|---|---|---|

OPUS | 9.89K (997 unique) | 375 | 530 |

Hansard-100 | 100 | 358 | 371 |

Hansard-1000 | 1,000 | 2,957 | 3,082 |

For each data set, we draw parallel data from a section that is disjointed from the monolingual sections. This data is only used in the “X% Parallel” settings, for which X% of the total data is drawn from the parallel section instead of the monolingual sections; for example, the “10% Parallel” setting for Hansard-1000 consists of 900 monolingual sentences in each language and 100 parallel sentence pairs.

### 8.2 Evaluation

We evaluate the accuracy of decipherment by the percentage of source words that are mapped to the correct target translation. We find the maximum-probability mapping for all source words; precision could be increased at the expense of recall by imposing some threshold, below which no mapping would be made for a given source word. The correct translation for each source word was determined automatically using the Google Translation API. Although the Google Translation API did a fair job of translating the French and Spanish words to English, it returned only a single target translation. We noticed occasional cases where the decipherment algorithm retrieved a correct translation, but it did not get credit because of not matching the translation from the API.

Additionally, we performed Viterbi decoding on the sentences in a small held-out test corpus from the OPUS data set, and compared the BLEU scores with the previously published results on the same test set as Ravi and Knight (2011) and Nuhn and Ney (2014). Our training set, however, was different: Their data were parallel, so we split the data set into two disjoint sections, one for each language. This reduced our model’s performance (as expected), but we still achieve a higher BLEU score than the baselines.

### 8.3. Results

We experimented with three versions of our log-linear MRF decipherment models: (1) Gibbs sampling, (2) IMH sampling, and (3) contrastive divergence (CD). We also tested the effect of exploiting parallel data under the CD model. To determine the impact of the orthographic and length features, the contrastive divergence–based log-linear model was tested both with and without these features. In addition to the proposed undirected MRF models, we also explored the directed Feature-HMM model (Berg-Kirkpatrick et al. 2010), which is trained via an EM-style algorithm, and has the same computational complexity as EM. We compared the feature-based models with the exact EM algorithm (Koehn and Knight 2000; Ravi and Knight 2011). We used Kneser-Ney smoothing (Kneser and Ney 1995) for training bigram language models. The number of iterations was fixed to 15 for all five methods; we did not see improvement beyond roughly 10 iterations during development. For the sampling based methods, we set the number of samples *N* = 50, which seemed to strike a good balance between accuracy and speed during our small-scale experiments during development.

For the log-linear model with no orthographic/length features, we initialized all the feature weights to zero. When we included the orthographic features, we initialized the weight of the orthographic match feature to 1.0 to encourage translation pairs with high orthographic similarity. Furthermore, for each word pair (*f*, *e*) with high orthographic similarity, we assigned a small positive weight (0.1). This initialization allowed the proposal distribution to sample orthographically similar target words for each source word. The value 0.1 seemed to work well in initial small-scale experiments. For exact EM, we initialized the translation probabilities uniformly and stored the entire probability table.

Table 4 reports the accuracy and running time per iteration for exact EM, Feature HMM, and our log-linear models on the OPUS, Hansard-100, and Hansard-1000 data sets. However, on the Hansard-1000 data set, we only applied the contrastive divergence and IMH based log-linear models because of its large vocabulary size. Table 2 summarizes the computational complexity of each method; recall that *N*_{F} scales with *V* (theoretically *N*_{F} ∈ *O*(*V*^{2}), although empirically we found *N*_{F} ∈ *O*(*V*)). From this, we can loosely estimate that the Gibbs sampling would take roughly a week to execute 15 iterations, whereas the EM and Feature HMM methods would take roughly a month.

Method . | OPUS . | Hansard-100 . | Hansard-1000 . | |||
---|---|---|---|---|---|---|

Time (sec) . | Acc (%) . | Time (sec) . | Acc (%) . | Time (sec) . | Acc (%) . | |

EM | 417.2 | 2.63 | 188.0 | 2.96 | – | – |

Feature HMM | 379.6 | 7.71 | 189.9 | 14.17 | – | – |

Log-linear + Gibbs | 738.1 | 6.77 | 357.9 | 14.01 | – | – |

Log-linear + IMH | 75.7 | 6.77 | 53.0 | 13.10 | 605.5 | 12.45 |

Log-linear + CD | 19.1 | 6.13 | 10.6 | 11.53 | 324.1 | 11.19 |

Log-linear + CD, Sentence | 21.6 | 8.08 | 12.7 | 12.60 | 458.3 | 12.02 |

Log-linear + CD, Sentence, No ortho/len | 22.4 | 0.56 | 11.9 | 1.88 | 492.2 | 0.36 |

Method . | OPUS . | Hansard-100 . | Hansard-1000 . | |||
---|---|---|---|---|---|---|

Time (sec) . | Acc (%) . | Time (sec) . | Acc (%) . | Time (sec) . | Acc (%) . | |

EM | 417.2 | 2.63 | 188.0 | 2.96 | – | – |

Feature HMM | 379.6 | 7.71 | 189.9 | 14.17 | – | – |

Log-linear + Gibbs | 738.1 | 6.77 | 357.9 | 14.01 | – | – |

Log-linear + IMH | 75.7 | 6.77 | 53.0 | 13.10 | 605.5 | 12.45 |

Log-linear + CD | 19.1 | 6.13 | 10.6 | 11.53 | 324.1 | 11.19 |

Log-linear + CD, Sentence | 21.6 | 8.08 | 12.7 | 12.60 | 458.3 | 12.02 |

Log-linear + CD, Sentence, No ortho/len | 22.4 | 0.56 | 11.9 | 1.88 | 492.2 | 0.36 |

Table 5 reports the accuracy for the methods that utilize parallel data on the three data sets. For comparison, the final two columns of Table 5 report the accuracy of IBM Model 1 and Model 4 (Brown et al. 1993) when trained on the parallel data used in the corresponding Hansard-1000 experiment; to allow for direct comparison, both were evaluated over the same vocabulary as in the Hansard-1000 experiment. Because our training procedure includes random sampling, the results of each run on a given data set can vary. We observe only very small variations between executions, but all reported results for sampling-based methods are the average of 10 separate executions of the system. A bigram language model was used for all the models.

Accuracy (%) . | Opus . | Hansard-100 . | Hansard-1000 . | IBM Model 1 . | IBM Model 4 . |
---|---|---|---|---|---|

Monolingual-Only | 8.08 | 12.60 | 12.02 | 0.00 | 0.00 |

10% Parallel | 9.45 | 16.81 | 13.58 | 6.74 | 7.29 |

20% Parallel | 10.81 | 17.62 | 13.41 | 9.67 | 10.75 |

50% Parallel | 15.04 | 18.83 | 18.55 | 20.03 | 20.40 |

Accuracy (%) . | Opus . | Hansard-100 . | Hansard-1000 . | IBM Model 1 . | IBM Model 4 . |
---|---|---|---|---|---|

Monolingual-Only | 8.08 | 12.60 | 12.02 | 0.00 | 0.00 |

10% Parallel | 9.45 | 16.81 | 13.58 | 6.74 | 7.29 |

20% Parallel | 10.81 | 17.62 | 13.41 | 9.67 | 10.75 |

50% Parallel | 15.04 | 18.83 | 18.55 | 20.03 | 20.40 |

The BLEU scores for translation on the OPUS data set are reported in Table 6. We outperform previous approaches on this data set that use no parallel data. Although we are not aware of any work on the OPUS data set using small amounts of parallel data, Zoph et al. (2016) describe one recent alternative approach to translation with very limited parallel data for Urdu–English. Their hybrid system using a string-to-tree statistical translation model combined with a neural model achieved a BLEU score of 19.1. This result utilized three times as much data as in our experiments; 100% of it was parallel, and the model was pre-trained with a much larger corpus of parallel French–English data.

Method . | BLEU (%) . |
---|---|

EM (Ravi and Knight 2011) | 15.3 |

EM + Beam (Nuhn and Ney 2014) | 15.7 |

Feature HMM | 18.90 |

Log-linear + Gibbs | 21.43 |

Log-linear + IMH | 21.46 |

Log-linear + CD | 21.36 |

Log-linear + CD, Sentence | 20.71 |

Log-linear + CD, Sentence, No ortho/len | 19.36 |

Log-linear + CD, Sentence, 10% Parallel | 20.83 |

Log-linear + CD, Sentence, 20% Parallel | 21.18 |

Log-linear + CD, Sentence, 50% Parallel | 29.30 |

Method . | BLEU (%) . |
---|---|

EM (Ravi and Knight 2011) | 15.3 |

EM + Beam (Nuhn and Ney 2014) | 15.7 |

Feature HMM | 18.90 |

Log-linear + Gibbs | 21.43 |

Log-linear + IMH | 21.46 |

Log-linear + CD | 21.36 |

Log-linear + CD, Sentence | 20.71 |

Log-linear + CD, Sentence, No ortho/len | 19.36 |

Log-linear + CD, Sentence, 10% Parallel | 20.83 |

Log-linear + CD, Sentence, 20% Parallel | 21.18 |

Log-linear + CD, Sentence, 50% Parallel | 29.30 |

Table 7 shows a few examples for which the log-linear model performed better because of orthographic features.

OPUS . | Hansard-1000 . | ||
---|---|---|---|

Spanish
. | English
. | French
. | English
. |

excelente | excellent | criminel | criminal |

minuto | minute | particulier | particular |

silencio | silence | sociaux | social |

perfecto | perfect | secteur | sector |

OPUS . | Hansard-1000 . | ||
---|---|---|---|

Spanish
. | English
. | French
. | English
. |

excelente | excellent | criminel | criminal |

minuto | minute | particulier | particular |

silencio | silence | sociaux | social |

perfecto | perfect | secteur | sector |

## 9. Discussion and Future Work

We notice that all the feature-based models (both directed Feature-HMM and undirected log-linear models) with orthographic and length features outperformed the EM-based decipherment approach. The only log-linear model that performed much worse was the one which lacked the orthographic and length features. This result emphasizes the importance of orthographic features for decipherment between closely related language pairs. The margin of improvement due to orthographic features was bigger for the Hansard data sets than that for the OPUS data set. This is expected, as French has had a much larger historical influence on English than Spanish has, largely through the Norman Conquest; this is a major cause for the higher lexical similarity between French and English than between Spanish and English. Quantitatively, 42.72% of the pairs in our English–French gold dictionary were within the normalized edit distance threshold used for our corresponding feature, whereas only 20.97% of the English–Spanish pairs were. The contrastive divergence-based log-linear model achieved overall comparable accuracy to the two other sampling approaches (Gibbs and IMH + Gibbs), despite being orders of magnitude faster. Sentence-level sampling was slightly slower, but achieved higher accuracy than bigram-only sampling. Furthermore, the feature-based models resulted in better translations, as they obtained a higher BLEU score on the OPUS data set (Table 6).

Although the orthographic features provide huge improvements in decipherment accuracy, they also introduce new errors. For example, the Spanish word “madre” means “mother” in English, but our model gave the highest score to the English word “made” due to the high orthographic similarity. However, such error cases are rare compared with the improvement.

The contrastive divergence model that was modified to incorporate parallel data generally showed significant gains in accuracy as the proportion of parallel data was increased. This is expected: Parallel data provide a stronger signal for translation than monolingual data. We notice that, even when parallel data are provided, the model still learns additional information from the monolingual data. This is illustrated in Table 5, where we compare IBM Model 1 and Model 4 (which can only make use of parallel data) against our model and observe that more correct word translations are learned when additional monolingual data is provided. The exception to this trend is in the 50% Parallel setting, where the addition of monolingual data results in fewer correct translations. This may be because monolingual data is a noisy signal for translation, and incorporating too little with the parallel data actually confuses the model. Given that our technique is most applicable for low- and no-resource languages, for which having 50% parallel data is less realistic, we do not believe that this is a serious concern.

The accuracies reported here are significantly lower than those achieved by modern supervised methods (and unsupervised methods with large corpora). However, our results required no more than 1,000 lines of data from each language, and preserved accuracy with as little as 100 lines of data. Thus, this method in its current form is most applicable to languages with extremely limited available data. This can include “lost” languages and any of the numerous modern languages that do not have much data easily accessible online. Our model is also very scalable, and can be applied to settings with more data than we experiment with here but still insufficient data for modern embedding-based unsupervised methods.

For understudied languages, our system can also be used to *infer* the similarity of two languages. The final weight of the edit distance feature can be interpreted as the model’s estimate of similarity. In our experiments, the edit distance weight for the Hansard experiments was roughly four times that of the OPUS experiments, which matches our expectations, given the increased lexical similarity between French and English. For future work, our feature-based models can be extended by allowing local reordering of neighboring words and considering word fertilities (Ravi 2013). We would also like to extend the features to handle languages with different alphabets or systematically different use of certain characters, perhaps using transliteration techniques such as in Knight and Graehl (1998). Finally, we would like to incorporate more flexible non-local features in MRF, which may not be supported by the directed Feature-HMM model.

## 10. Conclusion

We presented a feature-based decipherment system using latent variable log-linear models. The proposed models take advantage of the orthographic similarities between closely related languages, and outperform the existing EM-based models. The contrastive divergence–based variant with sentence-level sampling provided the best trade-off between speed and accuracy. We also showed that it can be modified to incorporate parallel data when available, resulting in increased accuracy.

## Acknowledgments

We are grateful to the anonymous reviewers for suggesting useful additions. This research was supported by a Google Faculty award and NSF grant 1449828.

## Note

Ignoring the cost of estimating *q*_{s}(*e*|*f*), which occurs only once every five iterations.

## References

*EM*algorithm