Abstract
Data privacy is an important issue for “machine learning as a service” providers. We focus on the problem of membership inference attacks: Given a data sample and black-box access to a model’s API, determine whether the sample existed in the model’s training data. Our contribution is an investigation of this problem in the context of sequence-to-sequence models, which are important in applications such as machine translation and video captioning. We define the membership inference problem for sequence generation, provide an open dataset based on state-of-the-art machine translation models, and report initial results on whether these models leak private information against several kinds of membership inference attacks.
1 Motivation
There are many situations in which private entities are worried about the privacy of their data. For example, many companies provide black-box training services where users are able to upload their data and have customized models built for them, without requiring machine learning expertise. A common concern in these “machine learning as a service” offerings is that the uploaded data be visible only to the client that owns it.
Currently, these entities are in the position of having to trust that service providers abide by the terms of their agreements. Although trust is an important component in relationships of all kinds, it has its limitations. In particular, it falls short of a well-known security maxim, originating in a Russian proverb that translates as, Trust, but verify.1 Ideally, customers would be able to verify that their private data was not being slurped up by the serving company, whether by design or accident.
This problem has been formalized as the membership inference problem, first introduced by Shokri et al. (2017) and defined as: “Given a machine learning model and a record, determine whether this record was used as part of the model’s training dataset or not.” The problem can be tackled in an adversarial framework: The attacker is interested in answering this question with high accuracy, whereas the defender would like this question to be unanswerable (see Figure 1). Since then, researchers have proposed many ways to attack and defend the privacy of various types of models. However, the work so far has only focused on standard classification problems, where the output space of the model is a fixed set of labels.
In this paper, we propose to investigate membership inference for sequence generation problems, where the output space can be viewed as a chained sequence of classifications. Prime examples of sequence generation includes machine translation and text summarization: In these problems, the output is a sequence of words whose length is undetermined a priori. Other examples include speech synthesis and video caption generation. Sequence generation problems are more complex than classification problems, and it is unclear whether the methods and results developed for membership inference in classification problems will transfer. For example, one might imagine that whereas a flat classification model might leak private information when the output is a single label, a recurrent sequence generation model might obfuscate this leakage when labels are generated successively with complex dependencies.
We focus on machine translation (MT) as the example sequence generation problem. Recent advances in neural sequence-to-sequence models have improved the quality of MT systems significantly, and many commercial service providers are deploying these models via public API’s. We pose the main question in the following form:
Given black-box access to an MT model, is it possible to determine whether a particular sentence pair was in the training set for that model?
In the following, we define membership inference for sequence generation problems (§2) and contrast with prior work on classification (§3). Next we present a novel dataset (§4) based on state-of-the-art MT models.2 Finally, we propose several attack methods (§5) and present a series of experiments evaluating their ability to answer the membership inference question (§6). Our conclusion is that simple one-off attacks based on shadow models, which proved successful in classification problems, are not successful on sequence generation problems; this is a result that favors the defender. Nevertheless, we describe the specific conditions where sequence-to-sequence models still leak private information, and discuss the possibility of more powerful attacks (§7).
2 Problem Definition
We now define the membership inference attack problem for sequence-to-sequence models in detail. Following tradition in the security research literature, we introduce three characters:
Alice (the service provider)
builds a sequence-to-sequence model based on an undisclosed dataset and provides a public API. For MT, this API takes a foreign sentence f as input and returns an English translation .
Bob (the attacker)
We term in-probes to be those probes where the true class is in, and out-probes to be those whose true class is out. Importantly, note that Bob has access not only to f but also to e in the probe. Intuitively, if is equivalent to e, then Bob may believe that the probe was contained in ; however, it may also be possible that Alice’s model generalizes well to new samples and translates this probe correctly. The challenge for Bob is to make this distinction; the challenge for Alice is to prevent Bob from doing so.
Carol (the neutral third-party)
is in charge of setting up the experiment between Alice and Bob. She decides which data samples should be used as in-probes and out-probes and evaluates Bob’s classification accuracy. Carol is introduced only to clarify the exposition and to set up a fair experiment for research purposes. In practical scenarios, Carol does not exist: Bob decides his own probes, and Alice decides her own .
2.1 Detailed Specification
In order to be precise about how Carol sets up the experiment, we will explain in terms of machine translation, but note that the problem definition applies to any sequence-to-sequence problem. A training set for MT consists of a set of sentence pairs . We use a label d ∈{ℓ1, ℓ2, …} to indicate the domain (the subcorpus or the data source), and an index i ∈{1, 2, …, I(d)} to indicate the sample id in the domain (subcorpus). For example, with d = ℓ1 and i = 1 might refer to the first sentence in the Europarl subcorpus, while with d = ℓ2 and i = 1 might refer to the first sentence in the CommonCrawl subcorpus. I(d) is the maximum number of sentences in the subcorpus with label d. The distinction among subcorpora is not necessary in the abstract problem definition, but is important in practice when differences in data distribution may reveal signals in membership.
Note that both and are sentence pairs that come from the same subcorpus; the only difference is that the former is included in whereas the latter is not.
Note that this dataset is like but with two exceptions: All samples from subcorpora ℓ2 and all samples from are discarded. One can view ℓ2 as Alice’s own in-house corpus which Bob has no knowledge of or access to, and ℓ1 as the shared corpus where membership inference attacks are performed.
To summarize, Carol gives to Alice, who uses it in whatever way she chooses to build a sequence-to-sequence model . The model is trained on with hyperparameters Θ (e.g., neural network architecture) known only to Alice. In parallel, Carol gives ℬall to Bob, who uses it to design various attack strategies, resulting in a classifier g(⋅) (see Section 5). When it is time for evaluation, Carol provides both probes and to Bob in randomized order and asks Bob to classify each sample as in or out. For each probe (), Bob is allowed to make one call to Alice’s API to obtain .
Both and should be classified as out by Bob’s classifier. However, it has been known that sequence-to-sequence models behave very differently on data from domains/genre that is significantly different from the training data (Koehn and Knowles, 2017). The goal of having two out probes is to quantify the difficulty or ease of membership inference in different situations.
2.2 Summary and Alternative Definitions
Figure 2 summarizes the problem definition. The probes and are by construction outside of Alice’s training data , while the probe is included. Bob’s goal is to produce a classifier that can make this distinction. He has at his disposal a smaller dataset ℬall, which he can use in whatever way he desires.
There are alternative definitions of this membership inference problem. For example, one can allow Bob to make multiple API calls to Alice’s model for each probe. This enlarges the repository of potential attack strategies for Bob. Or, one could evaluate Bob’s accuracy not on a per-sample basis, but allow for a coarser granularity where Bob can aggregate inferences over multiple samples. There is also a distinction between white-box and black-box attacks: We focus on the black-box case where Bob has no internal access to the internal parameters of Alice’s model, but can only guess at likely model architectures. In the white-box case, Bob would have access to Alice’s model internals, so different attacks would be possible (e.g., backpropagation of gradients). In these respects, our problem definition makes the problem more challenging for Bob the attacker.
Finally, note that Bob is not necessarily always the “bad guy”. Some examples of who Alice and Bob might be in MT are: (1) Organizations (Bob) that provide bitext data under license restrictions might be interested to determine whether their licenses are being complied with in published models (Alice). (2) The organizers (Bob) of an annual bakeoff (e.g., WMT) might wish to confirm that the participants (Alice) are following the rules of not training on test data. (3) “MT as a service” providers may support customized engines if users upload their own bitext training data. The provider promises that the user-supplied data will not be used in the customized engines of other users, and can play both Alice and Bob, attacking its own model to provide guarantees to the user. If it is possible to construct a successful membership inference mechanism, then many “good guys” would be able to provide the aforementioned fairness (1, 2) and privacy guarantees (3).
3 Related Work
Shokri et al. (2017) introduced the problem of membership inference attacks on machine learning models. They showed that with shadow models trained on either realistic or synthetic datasets, Bob can build classifiers that can discriminate and with high accuracy. They focus on classification problems such as CIFAR image recognition and demonstrate successful attacks on both convolutional neural net models as well as the models provided by Amazon ML.
Why do these attacks work? The main information exploited by Bob’s classifier is the output distribution of class labels returned by Alice’s API. The prediction uncertainty differs for data samples inside and outside the model training data, and this can be exploited. Shokri et al. (2017) propose defense strategies for Alice, such as restricting the prediction vector to top-k classes, coarsening the values of the output probabilities, and increasing the entropy of the prediction vector. The crucial difference between their work and ours, besides our focus on sequence generation problems, is the availability of this kind of output distribution provided by Alice. Although it is common to provide the whole distribution of output probabilities in classification problems, this is not possible in sequence generation problems because the output space of sequences is exponential in the output length. At most, sequence models can provide a score for the output prediction , for example with a beam search procedure, but this is only one number and not normalized. We do experiment with having Bob exploit this score (Table 3), but it appears far inferior to the use of the whole distribution available in classification problems.
Subsequent work on membership inference has focused on different angles of the problem. Salem et al. (2018) investigated the effect of training the shadow model and datasets that match or do not match the distribution of , and compared training a single shadow model as opposed to many. Truex et al. (2018) present a comprehensive evaluation of different model types, training data, and attack strategies. Borrowing ideas from adversarial learning and minimax games, Hayes et al. (2017) propose attack methods based on generative adversarial networks, while Nasr et al. (2018) provide adversarial regularization techniques for the defender. Nasr et al. (2019) extend the analysis to white-box attacks and a federated learning setting. Pyrgelis et al. (2018) provide an empirical study on location data. Veale et al. (2018) discuss membership inference and the related model inversion problem, in the context of data protection laws like GDPR.
Shokri et al. (2017) note a synergistic connection between the goals of learning and the goals of privacy in the case of membership inference: The goal of learning is to generalize to data outside the training set (e.g., so that and are translated well), while the goal of privacy is to prevent leaking information about data in the training set. The common enemy of both goals is overfitting. Yeom et al. (2017) analyze how overfitting by Alice’s increases the risk privacy leakage; Long et al. (2018) showed that even a well-generalized model holds such risks in classification problems, implying that overfitting by Alice is a sufficient but not necessary condition for privacy leakage.
A large body of work exists in differential privacy (Dwork, 2008; Machanavajjhala et al., 2017). Differential privacy provides guarantees that a model trained on some dataset will produce statistically similar predictions as a model trained on another dataset which differs in exactly one sample. This is one way in which Alice can defend her model (Rahman et al., 2018), but note that differential privacy is a stronger notion and often involves a cost in Alice’s model accuracy. Membership inference assumes that content of the data is known to Bob and only is concerned whether it was used. Differential privacy also protects the content of the data (i.e., the actual words in () should not be inferred).
Song and Shmatikov (2019) explored the membership inference problem of natural language text, including word prediction and dialog generation. They assume that the attacker has access to a probability distribution or a sequence of distributions over the vocabulary for the generated word or sequence. This is different from our work where the attacker gets only the output sequence, which we believe is a more realistic setting.
4 Data and Evaluation Protocol
4.1 Data: Subcorpora and Splits
Based on the problem definition in Section 2, we construct a dataset to investigate the possibility of the membership inference attack on MT models. We make this dataset available to the public to encourage further research.4
There are various considerations to ensure the benchmark is fair for both Alice and Bob: We need a dataset that is large and diverse to ensure Alice can train state-of-the-art MT models and Bob can test on probes from different domains. We used corpora from the Conference on Machine Translation (WMT18) (Bojar et al., 2018). We chose the German–English language pair because it has a reasonably large amount of training data, and previous work demonstrate high BLEU scores.
We now describe how Carol prepares the data for Alice and Bob. First, Carol selects four subcorpora for the training data of Alice, namely, CommonCrawl, Europarl v7, News Commentary v13, and Rapid 2016. A subset of these four subcorpora are also available to Bob (ℓ1 in § 2.1). In addition, Carol gives ParaCrawl to Alice but not Bob (ℓ2 in §2.1). We can think of it as in-house data that the service provider holds. For all these subcorpora, Carol first performs basic preprocessing: (a) tokenization of both the German and English sides using the Moses tokenizer, (b) de-duplication of sentence pairs so that only unique pairs are present, and (c) randomly shuffling all sentences prior to splitting into probes and MT training data.5
Figure 3 illustrates how Carol splits subcorpora for Alice and Bob. For each subcorpus, Carol splits them to create probes and , and and ℬall. Carol sets k = 5,000, meaning each probe set per subcorpus has 5,000 samples. For each subcorpus, Carol selects 5,000 samples to create . She then uses the rest as and select 5,000 from it as . She excludes and ParaCrawl from to create a dataset for Bob, ℬall.6 In addition, Carol has four other domains to create out-of-domain probe set , namely, EMEA and Subtitles 18 (Tiedemann, 2012), Koran (Tanzil), and TED (Duh, 2018). These subcorpora are equivalent to ℓ3 in § 2.1. The size of is 5,000 per subcorpus, same as and . The number of samples for each set is summarized in Table 1.
. | . | . | . | ℬall . | . |
---|---|---|---|---|---|
ParaCrawl | 5,000 | 5,000 | 4,518,029 | 0 | N/A |
CommonCrawl | 5,000 | 5,000 | 2,389,123 | 2,379,123 | N/A |
Europarl | 5,000 | 5,000 | 1,865,271 | 1,855,271 | N/A |
News | 5,000 | 5,000 | 273,702 | 263,702 | N/A |
Rapid | 5,000 | 5,000 | 1,062,214 | 1,052,214 | N/A |
EMEA | N/A | N/A | N/A | N/A | 5,000 |
Koran | N/A | N/A | N/A | N/A | 5,000 |
Subtitles | N/A | N/A | N/A | N/A | 5,000 |
TED | N/A | N/A | N/A | N/A | 5,000 |
TOTAL | 25,000 | 25,000 | 10,108,339 | 5,550,310 | 20,000 |
. | . | . | . | ℬall . | . |
---|---|---|---|---|---|
ParaCrawl | 5,000 | 5,000 | 4,518,029 | 0 | N/A |
CommonCrawl | 5,000 | 5,000 | 2,389,123 | 2,379,123 | N/A |
Europarl | 5,000 | 5,000 | 1,865,271 | 1,855,271 | N/A |
News | 5,000 | 5,000 | 273,702 | 263,702 | N/A |
Rapid | 5,000 | 5,000 | 1,062,214 | 1,052,214 | N/A |
EMEA | N/A | N/A | N/A | N/A | 5,000 |
Koran | N/A | N/A | N/A | N/A | 5,000 |
Subtitles | N/A | N/A | N/A | N/A | 5,000 |
TED | N/A | N/A | N/A | N/A | 5,000 |
TOTAL | 25,000 | 25,000 | 10,108,339 | 5,550,310 | 20,000 |
4.2 Alice MT Architecture
Alice uses her dataset (consisting of four subcorpora and ParaCrawl) to train her own MT model. Because Paracrawl is noisy, Alice first applies dual conditional cross-entropy filtering (Junczys-Dowmunt, 2018), retaining the top 4.5 million lines. Alice then trains a joint BPE subword model (Sennrich et al., 2016) using 32,000 merge operations. No recasing is applied.
4.3 Evaluation Protocol
If the accuracy is 50%, then the binary classification is same as random, and Alice is safe. An accuracy slightly above 50% can be considered a potential breach of privacy.
5 Membership Inference Attacks
5.1 Shadow Model Framework
Bob’s initial approach for attack is to use “shadow models”, similar to Shokri et al. (2017). The idea is that Bob creates MT models with his data to mimic (shadow) the behavior of Alice’s MT model, then train a membership inference classifier on these shadow models. To do so, Bob splits his data ℬall into his own version of in-probe, out-probe, and training set in multiple ways to train MT models. Then he translates these probe sentences with his own shadow MT models, and use the resulting with its in or out label to train a binary classifier . If Bob’s shadow models are sufficiently similar to Alice’s in behavior, this attack can work.
Bob first selects 10 sets of 5,000 sentences per subcorpus in ℬall. He then chooses two sets and uses one as in-probe and the other as out-probe, and combines in-probe and the rest (ℬall minus 10 sets) as a training sets. We use notations , and for the first group of in-probe, out-probe, and training sets. Bob then swaps the in-probe and out-probe to create another group. We notate this as , , and . With 10 sets of 5,000 sentences, Bob can create 10 different groups of in-probe, out-probe, and training sets. Figure 4 illustrates the data splits.
For each group of data, Bob first trains a shadow MT model using the training set. He then uses this model to translate sentences in the in-probe and out-probe sets. Bob has now a list of from different shadow models, and he knows for each sample if it was in or out of the training data for the MT model used to translate that sentence.
5.2 Bob MT Architecture
Bob’s model is a 4-layer Transformer, with no tied embedding, model/embedding size 512, 8 attention heads, 1,024 hidden states in the feed forward layers, word-based batch size of 4,096. The model is optimized with Adam (Kingma and Ba, 2015), regularized with label smoothing (0.1), and trained until perplexity on newstest2016 (Bojar et al., 2016) had not improved for 16 consecutive checkpoints, computed every 4,000 batches. Bob has BPE subword models with vocab size 30k for each language. The mean BLEU scores of the ten shadow models on newstest2018 is 38.6±0.2 (compared with 42.6 for Alice).
5.3 Membership Inference Classifier
Bob extracts features from for a binary classifier. He uses modified 1- to 4-gram precisions and smoothed sentence-level BLEU score (Lin and Och, 2004) as features. Bob’s intuition is that if an unusually large number of n-grams in matches e, then it could be a sign that this was in the training data and Alice memorized it. Bob calculates n-gram precision by counting the number of n-grams in translation that appear in the reference sentence. In the later investigation Bob also considers the MT model score as an extra feature.
Bob tries different types of classifiers, namely, Perceptron (P), Decision Tree (DT), Naïve Bayes (NB), Nearest Neighbors (NN), and Multi-layer Perceptron (MLP). DT uses GINI impurity for the splitting metrics, and the max depth to be 5. Our NB uses Gaussian distribution. For NN we set the number of neighbors to be 5 and use Minkowski distance. For MLP, we set the size of the hidden layer to be 100, the activation function to be ReLU, and the L2 regularization term α to be 0.0001.
Algorithm 1 summarizes the procedure to construct a membership inference classifier g(⋅) using Bob’s dataset ℬall. For training the binary classifiers, Bob uses models from data splits 1 to 3 for training, 4 for validation, and 5 for his own internal testing. Note that the final evaluation of the attack is done using the translations of and with Alice’s MT model, by Carol.
6 Attack Results
We now present a series of results based on the shadow model attack method described in Section 5. In Section 6.1 we will observe that Bob has difficulty attacking Alice under our definition of membership inference. In Sections 6.2 and 6.3 we will see that Alice nevertheless does leak some private information under more nuanced conditions. Section 6.4 describes the possibility of attacks beyond sentence-level membership. Section 6.5 explores the attacks using external resources.
6.1 Main Result
Table 2 shows the accuracy of the membership inference classifiers. There are 5 different types of classifiers, as described in Section 5.3. The numbers in the Alice column shows the attack accuracy on Alice probes and ; these are the main results. The numbers in Bob columns show the results on the Bob classifiers’ train, validation, and test sets, as described in Section 5.3.
. | Alice . | Bob:train . | Bob:valid . | Bob:test . |
---|---|---|---|---|
P | 50.0 | 50.0 | 50.0 | 50.0 |
DT | 50.4 | 51.4 | 51.2 | 51.1 |
NB | 50.4 | 51.2 | 51.1 | 51.0 |
NN | 49.9 | 61.6 | 50.5 | 50.0 |
MLP | 50.2 | 50.8 | 50.8 | 50.8 |
. | Alice . | Bob:train . | Bob:valid . | Bob:test . |
---|---|---|---|---|
P | 50.0 | 50.0 | 50.0 | 50.0 |
DT | 50.4 | 51.4 | 51.2 | 51.1 |
NB | 50.4 | 51.2 | 51.1 | 51.0 |
NN | 49.9 | 61.6 | 50.5 | 50.0 |
MLP | 50.2 | 50.8 | 50.8 | 50.8 |
The results of the attacks on the Alice model show that it is around 50%, meaning that the attack is not successful and the binary classification is almost the same as a random choice.9 The accuracy is around 50% for Bob:valid, meaning that Bob also has difficulty attacking his own simulated probes, therefore the poor performance on and is not due to mismatches between Alice’s model and Bob’s model.
The accuracy is around 50% for Bob:train as well, revealing that the classifier g(⋅) is underfitting.10 This suggests that the current features do not provide enough information to distinguish in-probe and out-probe sentences. Figure 5 shows the confusion matrices of the classifier output on Alice probes. We see that for all classifiers, whatever prediction they make is incorrect half of the time.
Table 3 shows the result when MT model score is added as an extra feature for classification. The result indicates that this extra information does not improve the attack accuracy. In summary, these results suggest that Bob is not able to reveal membership information at the sentence/sample level. This result is in contrast to previous work on membership inference in “classification” problems, which demonstrated high accuracy with Bob’s shadow model attack.
. | Alice . | Bob:train . | Bob:valid . | Bob:test . |
---|---|---|---|---|
P | 49.7 | 49.2 | 49.3 | 49.4 |
DT | 50.4 | 51.5 | 51.1 | 51.2 |
NB | 50.1 | 50.2 | 50.1 | 50.2 |
NN | 50.2 | 67.1 | 50.2 | 50.0 |
MLP | 50.4 | 51.2 | 51.2 | 51.1 |
. | Alice . | Bob:train . | Bob:valid . | Bob:test . |
---|---|---|---|---|
P | 49.7 | 49.2 | 49.3 | 49.4 |
DT | 50.4 | 51.5 | 51.1 | 51.2 |
NB | 50.1 | 50.2 | 50.1 | 50.2 |
NN | 50.2 | 67.1 | 50.2 | 50.0 |
MLP | 50.4 | 51.2 | 51.2 | 51.1 |
Additionally, note that although accuracies are close to 50%, the number of Bob:test tend to be slightly higher than Alice’s for some classifiers. This may reflect the fact that Bob:test is a matched condition using the same shadow MT architecture, while Alice probes are from a mismatched condition using an unknown MT architecture. It is important to compare both numbers in the experiments: accuracy on Alice probes is the real evaluation and accuracy on Bob:test is a diagnostic.
6.2 Out-of-Domain Subcorpora
Carol prepared OOD subcorpora, , that are separate from and ℬall. The membership inference accuracy of each subcorpus is shown in Table 4. The accuracy for OOD subcorpora are much higher than that of original in-domain subcorpora. For example, the accuracy with Decision Tree was 50.3% and 51.1% for ParaCrawl and CommonCrawl (in-domain), whereas accuracy was 67.2% and 94.1% for EMEA and Koran (out-of-domain). This suggests that for OOD data Bob has a better chance to infer the membership.
. | ParaCrawl . | CommonCrawl . | Europarl . | News . | Rapid . | EMEA . | Koran . | Subtitles . | TED . |
---|---|---|---|---|---|---|---|---|---|
P | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 100.0 | 100.0 | 100.0 | 100.0 |
DT | 50.3 | 51.1 | 49.7 | 50.7 | 50.0 | 67.2 | 94.1 | 80.2 | 67.1 |
NB | 50.1 | 51.2 | 49.9 | 50.6 | 50.2 | 69.5 | 96.1 | 81.7 | 70.5 |
NN | 49.4 | 50.7 | 50.3 | 49.7 | 49.2 | 43.3 | 52.6 | 48.7 | 49.9 |
MLP | 49.6 | 50.8 | 49.9 | 50.3 | 50.7 | 73.6 | 97.9 | 84.8 | 85.0 |
. | ParaCrawl . | CommonCrawl . | Europarl . | News . | Rapid . | EMEA . | Koran . | Subtitles . | TED . |
---|---|---|---|---|---|---|---|---|---|
P | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 100.0 | 100.0 | 100.0 | 100.0 |
DT | 50.3 | 51.1 | 49.7 | 50.7 | 50.0 | 67.2 | 94.1 | 80.2 | 67.1 |
NB | 50.1 | 51.2 | 49.9 | 50.6 | 50.2 | 69.5 | 96.1 | 81.7 | 70.5 |
NN | 49.4 | 50.7 | 50.3 | 49.7 | 49.2 | 43.3 | 52.6 | 48.7 | 49.9 |
MLP | 49.6 | 50.8 | 49.9 | 50.3 | 50.7 | 73.6 | 97.9 | 84.8 | 85.0 |
In Table 4 we can see that Perceptron has accuracy 50% for all in-domain subcorpora and 100% for all OOD subcorpora. Note that the OOD subcorpora only have out-probes. By definition none of the samples from OOD subcorpora are in the training data. We get such accuracy because our Perceptron is always predicting out, as we can see in Figure 5. We believe this behavior is caused by applying Perceptron to inseparable data, and this particular model happened to be trained to act this way. To confirm this we have trained variations of Perceptrons by shuffling the training data, and observed that the resulting models had different output ratios of in and out, and in some cases always predicting in for both in and OOD subcorpora.
Figure 6 shows the distribution of sentence-level BLEU scores per subcorpus. The BLEU scores tend to be lower for OOD subcorpora, and the classifier may exploit this information to distinguish the membership better. But note that EMEA (out-of-domain) and CommonCrawl (in-domain) have similar BLEU scores, but vastly different membership accuracies, so the classifier may also be exploiting n-gram match distributions.
Overall, these results suggest that Bob’s accuracy depends on the specific type of probe being tested. If there is a wide distribution of domains, there is a higher chance that Bob may be able to reveal membership information. Note that in the actual scenario Bob will have no way of knowing what is OOD for Alice, so there is no signal that is exploitable for Bob. This section is meant as an error analysis that describes how membership inference classifiers behave differently in case the probe is OOD.
6.3 Out-of-Vocabulary Words
We also focused on the samples that contain the words that never appear in the training data of the MT model used for translation, that is, out-of-vocabulary (OOV) words. For this analysis, we focus only on vocabulary that does not exist in the training data of Bob’s shadow MT models, rather than Alice’s, since Bob does not have access to her vocabulary. By definition there are only out-probes in OOV subsets.
For Bob’s shadow models, 7.4%, 3.2%, and 1.9% of samples in the probe sets had one or more OOV words in source, reference, or both sentences, respectively. Table 5 shows the membership inference accuracy of the OOV subsets from the Bob test set, which is generally very high (>70%). This implies that sentences with OOV words are translated idiosyncratically compared with the ones without OOV words, and the classifier can exploit this.
6.4 Alternative Evaluation: Grouping Probes
Section 6.1 showed that it is generally difficult for Bob to determine membership for the strict definition of one sentence per probe. What if we loosen the problem, letting the probe be a group of sentences?
We create probes of 500 sentences each to investigate this hypothesis. Bob randomly samples 500 sentences with the same label from Bob’s training set to form a probe group. To create sufficient training data for his classifier, Bob repeats sampling and creates 6,000 groups. Bob uses sentence BLEU bin percentage and corpus BLEU as features for classification. For each group, Bob counts the sentence BLEU for each bin. The bin size is set to 0.01. Bob also uses all 500 translations together to calculate the group’s corpus BLEU score. Bob trains the classifiers using these features, and applies it to Bob’s validation and test sets, and Alice sets. These sets are evenly split into groups of 500, not sampled as done in training.
Table 6 shows the accuracy on probe groups. We can see that the accuracy is much higher than 50%, not only for Bob’s training set but also for his validation and test sets. However, for Alice, we found that classifiers were almost always predicting in, resulting the accuracy to be around 50%. This is due to the fact that classifiers were trained on shadow models that have lower BLEU scores than Alice. This suggests that we need to incorporate the information about the Alice / Bob MT performance difference.
. | Bob . | Alice . | |||
---|---|---|---|---|---|
. | train . | valid . | test . | original . | adjusted . |
P | 71.6 | 69.4 | 68.1 | 50.0 | 59.0 |
DT | 70.4 | 65.6 | 64.4 | 52.0 | 61.0 |
NB | 72.9 | 67.5 | 70.0 | 50.0 | 50.0 |
NN | 77.4 | 66.9 | 62.5 | 51.0 | 50.0 |
MLP | 73.0 | 68.8 | 70.0 | 50.0 | 52.0 |
. | Bob . | Alice . | |||
---|---|---|---|---|---|
. | train . | valid . | test . | original . | adjusted . |
P | 71.6 | 69.4 | 68.1 | 50.0 | 59.0 |
DT | 70.4 | 65.6 | 64.4 | 52.0 | 61.0 |
NB | 72.9 | 67.5 | 70.0 | 50.0 | 50.0 |
NN | 77.4 | 66.9 | 62.5 | 51.0 | 50.0 |
MLP | 73.0 | 68.8 | 70.0 | 50.0 | 52.0 |
One way to adjust the difference is to directly manipulate the input feature values. We adjusted the feature values, compensating by the difference in mean BLEU scores, and accuracy on Alice probes increased to 60% for P and DT as shown in the “adjusted” column of Table 6. If the classifier took advantage of the absolute values in its decision, the adjustment may provide improvements. If that is not the case, then improvements are less likely. Before the adjustment, all classifiers were predicting everything to be in for Alice probes. Classifiers like NB and MLP apparently did not change how often they predict in even after the normalization, whereas classifiers like P and DT did. In a real scenario this BLEU difference can be reasonably estimated by Bob, since he can use Alice’s translation API to calculate the BLEU score on a held-out set, and compare it with his shadow models.
Another possible approach to handle the problem of classifiers always predicting in is to consider the relative size of classifier output score. We can rank the samples by the classifier output scores, and decide top N% to be in and rest to be out. Figure 7 shows how the accuracy changes when varying the in percentage. We can see that the accuracy can be much higher than the original result, especially if Bob can adjust the threshold based on his knowledge of in percentage in the probe.
This is the first strong general result for Bob, suggesting the membership inference attacks are possible if probes are defined as groups of sentences.11 Importantly, note that the classifier threshold adjustment is performed only for the classifiers in this section, and is not relevant for the classifiers in Section 6.1 to 6.3.
6.5 Attacks Using External Resources
Our results in Section 6.1 demonstrate the difficulty of general membership inference attacks. One natural question is whether attacks can be improved with even stronger features or classifiers, in particular by exploiting external resources beyond the dataset Carol provided to Bob. We tried two different approaches: one using a Quality Estimation model trained on additional data, and another using a neural sequence model with a pre-trained language model.
Quality Estimation (QE) is a task of predicting the quality of a translation at the sentence or word level. One may imagine that a QE model might produce useful feature to tease apart in and out because in translations may have detectable improvements in quality. To train this model, we used the external dataset from the WMT shared task on QE (Specia et al., 2018). Note that for our language pair, German to English, the shared task only had a labeled dataset for the SMT system. Our models are NMT, so the estimation quality may not be optimally matched, but we believe this is the best data available at this time. We applied the Predictor-Estimator (Kim et al., 2017) implemented in an open source QE framework OpenKiwi (Kepler et al., 2019). It consists of a predictor that predicts each token of the target sentence given the target context and the source, and estimator that takes features produced by the predictor to estimate the labels; both are made of LSTMs. We used this model as this is one of the best models seen in the shared tasks, and it does not require alignment information. The model metrics on the WMT18 dev set, namely, Pearson’s correlation, Mean Average Error, and Root Mean Squared Error for sentence-level scores, are 0.6238, 0.1276, and 0.1745, respectively.
We used the sentence score estimated by the QE model as an extra feature for classifiers described in Section 6.1. The results are shown in Table 7. We can see that this extra feature did not provide any significant influence to the accuracy. In a more detailed analysis, we find that the reason is that our in and out probes both contain a range of translations from low to high quality translations, and our QE model may not be sufficiently fine-grained to tease apart any potential differences. In fact, this may be difficult even for a human estimator.
. | Alice . | Bob:train . | Bob:valid . | Bob:test . |
---|---|---|---|---|
P | 50.0 | 49.9 | 50.0 | 50.0 |
DT | 50.3 | 51.4 | 51.1 | 51.1 |
NB | 50.4 | 51.2 | 51.1 | 51.0 |
NN | 49.8 | 66.1 | 50.0 | 50.1 |
MLP | 50.4 | 51.0 | 51.0 | 50.8 |
BERT | 50.0 | 50.0 | 50.0 | 50.0 |
. | Alice . | Bob:train . | Bob:valid . | Bob:test . |
---|---|---|---|---|
P | 50.0 | 49.9 | 50.0 | 50.0 |
DT | 50.3 | 51.4 | 51.1 | 51.1 |
NB | 50.4 | 51.2 | 51.1 | 51.0 |
NN | 49.8 | 66.1 | 50.0 | 50.1 |
MLP | 50.4 | 51.0 | 51.0 | 50.8 |
BERT | 50.0 | 50.0 | 50.0 | 50.0 |
Another approach to exploit external resources is to use a language model pre-trained on a large amount of text. In particular, we used BERT (Devlin et al., 2019), which has shown competitive results in many NLP tasks. We used BERT directly as a classifier, and followed a fine-tuning setup similar to paraphrase detection: For our case the inputs are the English translation and reference sentences, and the output is the binary membership label. This setup is similar to the classifiers we described in Section 5.3, where rather than training Perceptron or Decision Tree on manually defined features, we directly applied BERT-based sequence encoders on the raw sentences.
We fine-tuned the BERT Base,Cased English model with Bob:train. The results are shown in Table 7. Similar to previous results, the accuracy is 50% so the attack using BERT as classifier was not successful. Detailed examination of the BERT classifier probabilities show that they are scattered around 0.5 for all cases, but in general are quite random for both Bob and Alice probes. This result is similar to the other simpler classifiers in Section 6.1.
In summary, from these results we can see that even with external resources and more complex classifiers, sentence-level attack is still very difficult for Bob. We believe this attests to the inherent difficulty of the sentence-level membership inference problem.
7 Discussions and Conclusions
We formalized the problem of membership inference attacks on sequence generation tasks, and used machine translation as an example to investigate the feasibility of a privacy attack.
Our results in Section 6.1 and Section 6.5 show that Alice is generally safe and it is difficult for Bob to infer the sentence-level membership. In contrast to attacks on standard classification problems (Shokri et al., 2017), sequence generation problems maybe be harder to attack because the input and output spaces are far larger and complex, making it difficult to determine the quality of the model output or how confident the model is. Also, the output distribution of class labels is an effective feature for the attacker for standard classification problems, but is difficult to exploit in the sequence case.
However, this does not mean that Alice has no risk of leaking private information. Our analyses in Sections 6.2 and 6.3 show that Bob’s accuracy on out-of-domain and out-of-vocabulary data is above chance, suggesting that attacks may be feasible in conditions where unseen words and domains cause the model to behave differently. Further, Section 6.4 shows that for a looser definition of membership attack on groups of sentences, the attacker can win at a level above chance.
Our attack approach was a simple one, using shadow models to mimic the target model. Bob can attempt more complex strategies, for example, by using the translation API multiple times per sentence. Bob can manipulate a sentence, for example, by dropping or adding words, and observe how the translation changes. We may also use the metrics proposed by Carlini et al. (2018) as features for Bob; they show how recurrent models might unintentionally memorize rare sequences in the training data, and propose a method to detect it. Bob can also add “watermark sentences” that have some distinguishable characteristics to influence the Alice model, making attack easier. To guard against these attacks, Alice’s protection strategy may include random subsampling of training data or additional regularization terms.
Finally, we note some important caveats when interpreting our conclusions. The translation quality of the Alice and Bob MT models turned out to be similar in terms of BLEU. This situation favors Bob, but in practice Bob is not guaranteed to be able to create shadow models of the same standard, nor verify how well it performs compared with the Alice model. We stress that when one is to interpret the results, one must evaluate both on Bob’s test set and Alice probes side-by-side, like those shown in Tables 2, 3, and 7, to account for the fact that Bob’s attack on his own shadow model translations is likely an optimistic upper-bound on the real attack accuracy on Alice’s model.
We believe our dataset and analysis is a good starting point for research in these privacy questions. Although we focused on MT, the formulation is applicable to other kinds of sequence generation models such as text summarization and video captioning; these will be interesting as future work.
Acknowledgments
The authors thank the anonymous reviewers and the action editor, Colin Cherry, for their comments.
Notes
Popularized by Ronald Reagan in the context of nuclear disarmament.
We release the data to encourage further research in this new problem: https://github.com/sorami/tacl-membership
In the experiments, we will also consider extending the information available to Bob. For example, if Alice additionally provides the translation probabilities ρ in the API, then Bob can exploit that in the classifier as .
These are design decisions that balance between simple experimentation vs. realistic condition. Carol doing a common tokenization removes some of the MT-specific complexity for researchers who want to focus on the Alice or Bob models. However, in a real-world public API, Alice’s tokenization is likely to be unknown to Bob. We decided on a middle ground to have Carol perform a common tokenization, but Alice and Bob do their own subword segmentation.
We prepared two different pairs of and . Thus ℬall has 10k fewer samples than , and not 5k fewer. For the experiment we used only one pair, and kept the other for future use.
Three-way tied embeddings, model and embedding size 512, eight attention heads, 2,048 hidden states in the feed forward layers, layer normalization applied before each self-attention layer, and dropout and residual connections applied afterward, word-based batch size of 4,096.
Version 1.2.12, case-sensitive, “13a” tokenization for comparability with WMT.
Some numbers are slightly over 50%, which may be interpreted as small leak of privacy. Although the desired accuracy levels depend on the application, for the MT scenarios described in Section 2.2 Bob would need much higher accuracies. For example, if Bob is a bakeoff organizer, he might want accuracy above 60% in order to determine whether to manually check the submission. However, if Bob is providing “MT as a service” with strong privacy guarantees, he may need to provide the client with accuracy higher than 90%.
The higher accuracy for k-NN is an exception, but is due to having the exact same datapoint in the model as the input, which always becomes the nearest neighbor. When the k value is increased, the accuracy on in-sample data decreased.
We can imagine an alternative definition of this group-level membership inference where Bob’s goal is to predict the percentage of overlap with respect to Alice’s training data. This assumes that model trainers make corpus-level decisions about what data to train on. Reformulation of a binary problem to a regression problem may be useful for some purposes.
References
Author notes
Work done while visiting Johns Hopkins University.