Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
Academia.eduAcademia.edu
Interpreting the Robustness of Neural NLP Models to Textual Perturbations Yunxiang Zhang1 , Liangming Pan2 , Samson Tan2 , Min-Yen Kan2 1 Wangxuan Institute of Computer Technology, Peking University 2 School of Computing, National University of Singapore [email protected], [email protected], {samson.tmr,kanmy}@comp.nus.edu.sg arXiv:2110.07159v2 [cs.CL] 18 Mar 2022 Abstract Modern Natural Language Processing (NLP) models are known to be sensitive to input perturbations and their performance can decrease when applied to real-world, noisy data. However, it is still unclear why models are less robust to some perturbations than others. In this work, we test the hypothesis that the extent to which a model is affected by an unseen textual perturbation (robustness) can be explained by the learnability of the perturbation (defined as how well the model learns to identify the perturbation with a small amount of evidence). We further give a causal justification for the learnability metric. We conduct extensive experiments with four prominent NLP models — TextRNN, BERT, RoBERTa and XLNet — over eight types of textual perturbations on three datasets. We show that a model which is better at identifying a perturbation (higher learnability) becomes worse at ignoring such a perturbation at test time (lower robustness), providing empirical support for our hypothesis. 1 Introduction Despite the success of deep neural models on many Natural Language Processing (NLP) tasks (Liu et al., 2016; Devlin et al., 2019; Liu et al., 2019b), recent work has discovered that these models are not robust to noisy input from the real world and thus their performance will decrease (Prabhakaran et al., 2019; Niu et al., 2020; Ribeiro et al., 2020; Moradi and Samwald, 2021). A reliable NLP system should not be easily fooled by slight noise in the text. Although a wide range of evaluation approaches for robust NLP models have been proposed (Ribeiro et al., 2020; Morris et al., 2020; Goel et al., 2021; Wang et al., 2021), few attempts have been made to understand these benchmark results. Given the difference of robustness between models and perturbations, it is a natural question why models are more sensitive to some perturbations than others. It is crucial to avoid over-sensitivity to input perturbations, and understanding why it happens is useful for revealing the weaknesses of current models and designing more robust training methods. To the best of our knowledge, a quantitative measure to interpret the robustness of NLP models to textual perturbations has yet to be proposed. To improve the robustness under perturbation, it is common practice to leverage data augmentation (Li and Specia, 2019; Min et al., 2020; Tan and Joty, 2021). Similarly, how much data augmentation through the perturbation improves model robustness varies between models and perturbations. In this work, we aim to investigate two Research Questions (RQ): • RQ1: Why are NLP models less robust to some perturbations than others? • RQ2: Why does data augmentation work better at improving the model robustness to some perturbations than others? We test a hypothesis for RQ1 that the extent to which a model is affected by an unseen textual perturbation (robustness) can be explained by the learnability of the perturbation (defined as how well the model learns to identify the perturbation with a small amount of evidence). We also validate another hypothesis for RQ2 that the learnability metric is predictive of the improvement on robust performance brought by data augmentation along a perturbation. Our proposed learnability is inspired by the concepts of Randomized Controlled Trial (RCT) and Average Treatment Effect (ATE) from Causal Inference (Rubin, 1974; Holland, 1986). Estimation of perturbation learnability for a model consists of three steps: ¬ randomly labelling a dataset, ­ perturbing examples of a particular pseudo class with probabilities, and ® using ATE to measure the ease with which the model learns the perturbation. The core intuition for our Exp No. Measurement Label 0 Standard original 1 Robustness original 2 Data Augmentation original 3 4 Learnability random random Perturbation Training Examples Test Examples l∈∅ (xi , 0), (xj , 1) (xi , 0), (xj , 1) l ∈ {0, 1} l ∈ {0, 1} l′ ∈ {1′ } l′ ∈ {1′ } (xi , 0), (xj , 1) (xi , 0), (xj , 1) (x∗i , 0), (x∗j , 1) (xj , 0′ ), (x∗i , 1′ ) (xj , 0′ ), (x∗i , 1′ ) (x∗i , 0), (x∗j , 1) (x∗i , 0), (x∗j , 1) (x∗i , 1′ ) (xi , 1′ ) Table 1: Example experiment settings for measuring learnability, robustness and improvement by data augmentation. We perturb an example if its label falls in the set of label(s) in “Perturbation” column. ∅ means no perturbation at all. Training/test examples are the expected input data, assuming we have only one negative (xi , 0) and positive (xj , 1) example in our original training/test set. l′ is a random label and x∗ is a perturbed example. method is to frame an RCT as a perturbation identification task and formalize the notion of learnability as a causal estimand based on ATE. We conduct extensive experiments on four neural NLP models with eight different perturbations across three datasets and find strong evidence for our two hypotheses. Combining these two findings, we further show that data augmentation is only more effective at improving robustness against perturbations that a model is more sensitive to, contributing to the interpretation of robustness and data augmentation. Learnability provides a clean setup for analysis of the model behaviour under perturbation, which contributes better model interpretation as well. Contribution. This work provides an empirical explanation for why NLP models are less robust to some perturbations than others. The key to this question is perturbation learnability, which is grounded in the causality framework. We show a statistically significant inverse correlation between learnability and robustness. 2 Setup and Terminology As a pilot study, we consider the task of binary text classification. The training set is denoted as Dtrain = {(x1 , l1 ), ..., (xn , ln )}, where xi is the i-th example and li ∈ {0, 1} is the corresponding label. We fit a model f ∶ (x; θ) ↦ {0, 1} with parameters θ on the training data. A textual perturbation is a transformation g ∶ (x; β) → x∗ that injects a specific type of noise into an example x with parameters β and the resulting perturbed example is x∗ . We design several experiment settings (Table 1) to answer our research questions. Experiment 0 in Table 1 is the standard learning setup, where we train and evaluate a model on the original dataset. Below we detail other experiment settings. 2.1 Definitions Robustness. We apply the perturbations to test examples and measure the robustness of model to said perturbations as the decrease in accuracy. In Table 1, Experiment 1 is related to robustness measurement, where we train a model on unperturbed dataset and test it on perturbed examples. We denote the test accuracy of a model f (⋅) on examples perturbed by g(⋅) in Experiment 1 as ∗ A1 (f, g, Dtest ). Similarly, the test accuracy in Experiment 0 is A0 (f, Dtest ). Consequently, the robustness is calculated as the difference of test accuracies: ∗ robustness(f, g, D) = A1 (f, g, Dtest ) −A0 (f, Dtest ). (1) Models usually suffer a performance drop when encountering perturbations, therefore the robustness is usually negative, where lower values indicate decreased robustness. Improvement by Data Augmentation (Post Augmentation ∆). To improve robust accuracy (Tu et al., 2020) (i.e., accuracy on the perturbed test set), it is a common practice to leverage data augmentation (Li and Specia, 2019; Min et al., 2020; Tan and Joty, 2021). We simulate the data augmentation process by appending perturbed data to the training set (Experiment 2 of Table 1). We calculate the improvement on performance after data augmentation as the difference of test accuracies: ∗ ∆post_aug (f, g, D) = A2 (f, g, Dtest ) ∗ −A1 (f, g, Dtest ). (2) ∗ where A2 (f, g, Dtest ) denotes the test accuracy of Experiment 2. ∆post_aug is the higher the better. Learnability. We want to compare perturbations in terms of how well the model learns to identify them with a small amount of evidence. We cast learnability estimation as a perturbation classification task, where a model is trained to identify the perturbation in an example. We define that the learnability estimation consists of three steps, namely ¬ assigning random labels, ­ perturbing with probabilities, and ® estimating model performance. Below we introduce the procedure and intuition for each step. This estimation framework is further grounded in concepts from the causality literature in Section 3, which justifies our motivations. We summarize our estimation approach formally in Algorithm 1 (Appendix A). ¬ Assigning Random Labels. We randomly assign pseudo labels to each training example regardless of its original label. Each data point has equal probability of being assigned to positive (l′ = 1) or negative (l′ = 0) pseudo label. This results in a randomly labeled dataset ′ Dtrain = {(x1 ; l1′ ), ..., (xn , ln′ )}, where L′ ∼ Bernoulli(1, 0.5). In this way, we ensure that there is no difference between the two pseudo groups since the data are randomly split. ­ Perturbing with Probabilities. We apply the perturbation g(⋅) to each training example in one of the pseudo groups (e.g., l′ = 1 in Algorithm 1)1 . In this way, we create a correlation between the existence of perturbation and label (i.e., the perturbation occurrence is predictive of the label). We control the perturbation probability p ∈ [0, 1], i.e., an example has a specific probability p of being perturbed. This results in a perturbed training set ′∗ Dtrain = {(x∗1 , l1′ ), ..., (x∗n , ln′ )}, where the perturbed example x∗i is: Z ∼ U (0, 1), ∀i ∈ {1, 2, ..., n} ⎧ ⎪ ⎪g(xi ) li′ = 1 ∧ z < p, x∗i = ⎨ ⎪ otherwise. ⎪ ⎩xi (3) Here Z is a random variable drawn from a uniform distribution U (0, 1). Due to randomization in the formal step, now the only difference between the two pseudo groups is the occurrence of perturbation. 1 Because the training data is randomly split into two pseudo groups, applying perturbations to any one of the groups should yield same result. We assume that we always perturb into the first group (l′ = 1) hereafter. ® Estimating Model Performance. We train a model on the randomly labeled dataset with perturbed examples. Since the only difference between the two pseudo groups is the existence of the perturbation, the model is trained to identify the perturbation. The original test examples Dtest are ′ also assigned random labels and become Dtest . We perturb all of the test examples in one pseudo group (e.g., l′ = 1, as in step 2.1) to produce a perturbed ′∗ test set Dtest . Finally, the perturbation learnability is calculated as the difference of accuracies on ′∗ ′ Dtest and Dtest , which indicates how much the model learns from the perturbation’s co-occurrence with pseudo label: ′∗ learnability(f, g, p, D) = A3 (f, g, p, Dtest ) (4) avg_learnability(f, g, D) ∶= log AU C({(pi , (5) ′ −A4 (f, g, p, Dtest ). ′∗ ′ A4 (f, g, p, Dtest ) and A3 (f, g, p, Dtest ) are accuracies measured by Experiment 4 and 3 of Table 1, respectively. We observe that the learnability depends on perturbation probability p. For each model– perturbation pair, we obtain multiple learnability estimates by varying the perturbation probability (Figure 3). However, we expect that learnability of the perturbation (as a concept) should be independent of perturbation probability. To this end, we use the log AU C (area under the curve in log scale) of the p − learnability curve (Figure 3), termed as “average learnability”, which summarizes the overall learnability across different perturbation probabilities p1 , ..., pt : learnability(f, g, pi , D)) ∣ i ∈ {1, 2, ..., t}}). We use log AU C rather than AU C because we empirically find that the learnability varies substantially between perturbations when p is small, and a log scale can better capture this nuance. We also introduce learnability at a specific perturbation probability (Learnability @ p) as an alternate summary metric and provide a comparison of this metric against log AU C in Appendix D. 2.2 Hypothesis With the above-defined terminologies, we propose hypotheses for RQ1 and RQ2 in Section 1, respectively. Hypothesis 1 (H1): A model for which a perturbation is more learnable is less robust against the same perturbation at the test time. 294 3.1 A Causal Explanation for Random Label L T 295 Assignment the language of causality, this is “correlation is not causation". Causality provides insightnoise on how to 296 Natural (simulated by perturbations in this fully decouple the297effect of perturbation and other work) usually co-occurs with latent features in an P Y latent features. We the causal 298 introduceexample. If motivawe did not assign random labels and tions for step 1 and 3 of learnability estimation in 299 simply perturbed one of the original groups, there causal association the following 3.1 and 3.2 respectively. This Section is 300 not obvious because model encounters would bethe confounding latent features confounding that would association (a) Before randomization. 301 prevent us from theescausal effect of the this perturbation during training inestimating learnability 3.1 A Causal Explanation for Random Label L T L T Figure 4a illustrates this scenario. timation302while theyperturbation. do not in robustness measureAssignment303 Both perturbation P and latent feature T may affect ment. 304 the outcome Yin,3this while the latent feature is predicNatural noise (simulated by perturbations Hypothesis 2 (H2): A model for which a pertur305 tive offeatures label L.inSince work) usually co-occurs with latent an we make perturbation P on bation is306more learnable experiences bigger robustP Y Y with the example. If we did not assignexamples random labels andsame label, P is decidedPby L. ness gains with data augmentation along such a It therefore follows simply perturbed307 one of the original groups, therethat T is a confounder of the efcausal association causal association 308 fect of P that on Ywould , resulting in non-causal association would beperturbation. confounding latent features (b) After randomization. (a) Before randomization. 309 boththe flowing along thethe P exper← L ← T →(a)YBefore . Howrandomization. (b) After randomization. prevent usWe from estimating causal effect of validate Hypotheses 1 and 2path with 310 ever, if we do randomize the labels, P no longer L T perturbation. Figure 4a illustrates this scenario. iments on several perturbations and models deFigure 4: Causal graph explanation for decoupling perCausal turbation graph explanation for decoupling perhasand anyT causal parents (i.e., incomingFigure edges)1:(Figand latent feature with randomization. P is Both perturbation P and latent4.1 feature may affect scribed311 in Section 4.2. turbation and latent feature with randomization. is 3 312 the perturbation and T is the latentPfeature. L is the urefeature 4b). This is because perturbation is purely ranthe outcome Y , while the latent is predicthe perturbation and T is the latent feature. L is the original label and Y is the correctness of the predicted 313 we make dom. theon path represented by P ← L, all 3 L. A Since Causal View on Without Perturbation tive of label perturbation P original label and Y Y is the correctness of the predicted 314same label,of association Y is Pcausal. label. examples withLearnability the Pthe is decided by that L. flows from P tolabel. 315that T is a confounder As a result,ofwe It therefore follows thecan ef- directly calculate the causal causal In Section 2.1, we introduce the term 316 effect from the observed outcomes (Section 3.2).association fect of P on Y , resulting in non-causal association“learnability” (b) randomization. in an the intuitive weYmap it to experiments a formal, allow random. Without the path represented by P ← L, 317 Our usAfter to disflowing along path Pway. ← L Now ← T randomization → . Howquantitative measure in standard statistical frameall graph of the association that flows from P to Y is ever, if we do randomize the labels, P no longer Figure 4: Causal explanation for decoupling per3 is later defined in Section 3.2 works.parents Learnability is Yactually motivated by con-and causal. has any causal (i.e., incoming edges) (Fig- turbation As awith result, we can directly latent feature randomization. P is calculate the the perturbation and Teffect is thefrom latentthe feature. L is the ceptsis from theperturbation causality literature. We provide a ure 4b). This because is purely rancausal observed outcomes. 5 original label and Y is the correctness of the predicted dom. Without the path represented by P ← L, all brief introduction to basic concepts of causal in- label. of the association thatAppendix flows fromB.P In to fact, Y is causal. ference in learnability is the As a result, we can directly calculate the causal effect of perturbation on causal models, which is effect from the observed outcomes (Section often difficult to measure due to3.2). the confounding Our randomization experiments allow dislatent features. In the languageusoftocausality, this is 3 “correlation is not3.2causation”. Y is later defined in Section Causality provides insight on how to fully decouple the effect of perturbation and other latent features. We5introduce the causal motivations for step 2.1 and 2.1 of learnability estimation in the following Section 3.1 and 3.2, respectively. 3.1 A Causal Explanation for Random Label Assignment Natural noise (simulated by perturbations in this work) usually co-occurs with latent features in an example. If we did not assign random labels and simply perturbed one of the original groups, there would be confounding latent features that would prevent us from estimating the causal effect of the perturbation. Figure 1a illustrates this scenario. Both perturbation P and latent feature T may affect the outcome Y ,2 while the latent feature is predictive of label L. Since we make the perturbation P on examples with the same label, P is decided by L. It therefore follows that T is a confounder of the effect of P on Y , resulting in non-causal association flowing along the path P ← L ← T → Y . However, if we do randomize the labels, P no longer has any causal parents (i.e., incoming edges) (Figure 1b). This is because perturbation is purely 2 Y is later defined in Section 3.2 3.2 Learnability is a Causal Estimand We identify learnability as a causal estimand. In causality, the term “identification” refers to the process of moving from a causal estimand (Average Treatment Effect, ATE) to an equivalent statistical estimand. We show that the difference of accura′∗ ′ cies on Dtest and Dtest is actually a causal estimand. We define the outcome Y of a test example xi as the correctness of the predicted label: Yi (0) ∶= 1{f (xi )=li′ } . (6) where 1{⋅} is the indicator function. Similarly, the outcome Y of a perturbed test example x∗i is: Yi (1) ∶= 1{f (x∗i )=li′ } . (7) According to the definition of Individual Treatment Effect (ITE, see Equation 9 of Appendix B), we have IT Ei = 1{f (x∗i )=li′ } −1{f (xi )=li′ } . We then take the average over all the perturbed test examples (half of the test set)3 . This is our Average Treatment Effect (ATE): AT E = E[Y (1)] − E[Y (0)] = E[1{f (x∗ )=l′ } ] − E[1{f (x)=l′ } ] = P (f (x∗ ) = l′ ) − P (f (x) = l′ ) 3 ′∗ ′ = A(f, g, p, Dtest ) − A(f, g, p, Dtest ). (8) The other half of the test set (l = 0) is left unperturbed, following the same procedure in Section 2.1. Model predictions will not change for unperturbed ones, resulting in ITEs with zero values. Therefore, we do not take them into account for ATE calculation. ′ Perturbation Example Sentence None His quiet and straightforward demeanor was rare then and would be today. duplicate_punctuations His quiet and straightforward demeanor was rare then and would be today.. butter_fingers_perturbation His quiet and straightforward demeanor was rarw then and would be today. shuffle_word quiet would and was be and straightforward then demeanor His today. rare random_upper_transformation His quiEt and straightForwARd Demeanor was rare TheN and would be today. insert_abbreviation His quiet and straightforward demeanor wuz rare then and would b today. whitespace_perturbation His quiet and straightforward demean or wa s rare thenand would be today. visual_attack_letters Hiṩ qủiẽt ầռd strḁighṭḟorwẳrȡ dԑmeanoŕ wȃṣ rȧre tḫen and wouᶅd ϸә tອḏầȳ. leet_letters His qui3t and strai9htfor3ard d3m3an0r 3as rar3 t43n and 30uld 63 t0da4. Figure 2: An example sentence with different types of perturbations. where A(f, g, p, D) is the accuracy of model f (⋅) trained with perturbation g(⋅) at perturbation probability p on test set D. Therefore, we show that ATE is exactly the difference of accuracy on the perturbed and unperturbed test sets with random labels. And the difference is learnability according to Equation 4. We discuss another means of identification of ATE in Appendix C, based on the prediction probability. We compare between the probability-based and accuracy-based metrics there. We find that our accuracy-based metric yields better resolution, so we report this metric in the main text of this paper. 4 4.1 Experiments Perturbation methods Criteria for Perturbations. We select various character-level and word-level perturbation methods in existing literature that simulate different types of noise an NLP model may encounter in real-world situations. These perturbations are nonadversarial, label-consistent, and can be automatically generated at scale. We note that our perturbations do not require access to the model internal structure. We also assume that the feature of perturbation does not exist in the original data. Not all perturbations in the existing literature are suitable for our task. For example, a perturbation that swaps gender words (i.e., female → male, male → female) is not suitable for our experiments since we cannot distinguish the perturbed text from an unperturbed one. In other words, the perturbation function g(⋅) should be asymmetric, such that g(g(x)) ≠ x. Figure 2 shows an example sentence with different perturbations. Perturbation of “duplicate_punctuation” doubles the punctuation by appending a duplicate after each punctuation, e.g., “,” → “„”; “butter_fingers_perturbation” misspells some words with noise erupting from keyboard typos; “shuffle_word” randomly changes the order of word in the text (Moradi and Samwald, 2021); “random_upper_transformation” randomly adds upper cased letters (Wei and Zou, 2019); “insert_abbreviation” implements a rule system that encodes word sequences associated with the replaced abbreviations; “whitespace_perturbation” randomly removes or adds whitespaces to text; “visual_attack_letters” replaces letters with visually similar, but different, letters (Eger et al., 2019); “leet_letters” replaces letters with leet, a common encoding used in gaming (Eger et al., 2019). 4.2 Experimental Settings To test the learnability, robustness and improvement by data augmentation with different NLP models and perturbations, we experiment with four modern and representative neural NLP models: TextRNN (Liu et al., 2016), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b) and XLNet (Yang et al., 2019). For TextRNN, we use the implementation by an open-source text classification toolkit NeuralClassifier (Liu et al., 2019a). For the other three pretrained models, we use the bert-base-cased, roberta-base, xlnet-base-cased versions from Hugging Face (Wolf et al., 2020), respectively. These two platforms support most of the common NLP models, thus facilitating extension studies of more models in future. We use three common binary text classification datasets — IMDB movie reviews (IMDB) (Pang and Lee, 2005), Yelp polarity reviews (YELP) (Zhang et al., 2015), Quora Question Pair (QQP) (Iyer et al., 2017) — as our testbeds. IMDB and YELP datasets present the task of sentiment analysis, where each sentence is labelled Figure 3: Learnability of eight perturbations for four NLP models on three datasets, as a function of perturbation probability. Perturbation XLNet RoBERTa BERT TextRNN Average over models whitespace_perturbation shuffle_word duplicate_punctuations butter_fingers_perturbation random_upper_transformation insert_abbreviation visual_attack_letters leet_letters 1.638 1.740 1.086 1.590 1.583 1.783 1.824 1.816 1.436 1.597 1.499 1.369 1.520 1.585 1.921 2.163 1.492 1.766 1.347 1.788 1.721 1.564 1.898 1.817 0.878 0.594 2.050 1.563 2.039 2.219 2.094 2.463 1.361 1.424 1.495 1.578 1.716 1.788 1.934 2.065 Table 2: Average learnability (log AU C of corresponding curve in Figure 3) of each model–perturbation pair on IMDB dataset. Rows are sorted by average values over all models. The perturbation for which a model is most learnable is highlighted in bold while the following one is underlined. 0.20 0.15 0.0 0.10 0.2 0.4 0.05 0.00 0.05 0.3 0.10 = 0.643 * 1.0 1.5 2.0 avg learnability 2.5 (a) Learnability vs. Robustness 0.15 2.00 0.15 post aug 0.1 2.25 0.20 avg learnability 0.1 post aug robustness 0.2 1.75 1.50 0.10 1.25 0.05 = 0.756 * 1.0 1.5 2.0 avg learnability 2.5 (b) Learnability vs. Post Aug ∆ 1.00 0.75 0.00 0.4 0.3 0.2 0.1 robustness 0.0 (c) Learn. vs. Robu. vs. Post Aug ∆ Figure 4: Linear regression plots of learnability vs. robustness vs. post data augmentation ∆ on IMDB dataset. Each point in the plots represents a model-perturbation pair. ρ is Spearman correlation. ∗ indicates high significance (p-value < 0.001). as positive or negative sentiment. QQP is a paraphrase detection task, where each pair of sentences is marked as semantically equivalent or not. To control the effect of dataset size and imbalanced classes, all datasets are randomly subsampled to the same size as IMDB (50k) with balanced classes. The training steps for all experiments are the same as well. We implement perturbations g(⋅) with two self-designed ones and six selected ones from the NL-Augmenter library (Dhole et al., 2021). For perturbation probabilities, we choose 0.001, 0.005, 0.01, 0.02, 0.05, 0.10, 0.50, 1.00. We run all experiments across three random seeds and report the average results. 4.3 Perturbation Learnability Analysis Figure 3 shows learnability as a function of perturbation probability. Learnability @ p generally increases as we increase the perturbation probability, and when we perturb all the examples (i.e., p = 1.0), every model can easily identify it well, resulting in the maximum learnability of 1.0. This shows that neural NLP models master these perturbations eventually. At lower perturbation probabilities, some models still learn that perturbation alone predicts the label. In fact, the major difference between different p − learnability curves is the area of lower perturbation probabilities and this provides motivation for using log AU C instead of AU C as the summarization of learnability at different p (Section 2.1). Table 2 shows the average learnability over all perturbation probabilities of each model– perturbation pair on IMDB dataset in Figure 3.4 It reveals the most learnable perturbation for each model. For example, the learnability of “visual_attack_letters” and “leet_letters” are very high for all four models, likely due to their strong effects on the tokenization process (Salesky et al., 2021). Perturbations like “white_space_perturbation” and “duplicate_punctuations” are less learnable for pretrained models, probably because they have weaker effects on the subword level tokenization, or they may have encountered similar noise in the pretraining corpora. We observe that “duplicate_punctuations” already exists in the original text of YELP dataset (e.g., “The burgers are awesome!!”), thus violating our assumptions for perturbations in Section 4.1. As a result, the curve for ρ IMDB YELP QQP Avg. learnability vs. robustness -0.643* -0.821* -0.695* Avg. learnability vs. post aug ∆ 0.756* 0.846* 0.750* Table 3: Correlations of average learnability vs. robustness vs. post data augmentation ∆. ρ is Spearman correlation. ∗ indicates high significance (p-value < 0.001). this perturbation substantially deviates from others in Figure 3. We do not count this perturbation on YELP dataset in the following analysis. The perturbation learnability experiments provide a clean setup for NLP practitioners to analyze the effect of textual perturbations on models. 4.4 We observe a negative correlation between learnability (Equation 4) and robustness (Equation 1) across all three datasets in Table 2, validating Hypothesis 1. Table 2 also quantifies the trend that data augmentation with a perturbation the model is less robust to has more improvement on robustness (Hypothesis 2). We plot the correlations on IMDB dataset in Figure 4a and 4b.5 Both the correlations between 1) learnability vs. robustness and 2) learnability vs. improvement by data augmentation are strong (Spearman ∣ρ∣ > 0.6) and highly significant (p-value < 0.001), which firmly supports our hypotheses. Our findings provide insight about when the model is less robust and when data augmentation works better for improving robustness. Figure 4c shows that the more learnable a perturbation is for a model, the greater the likelihood that its robustness can be improved through data augmentation along this perturbation. We argue that this is not simply because there is more room for improvement by data augmentation. From a causal perspective, learnability acts as a common cause (confounder) for both robustness and improvement by data augmentation. This indicates a potential limitation of using data augmentation for improving robustness to perturbations (Jha et al., 2020): data augmentation is only more effective at improving robustness against perturbations more learnable for a model. 5 4 Please refer to Appendix E for benchmark results on YELP (Table 5) and QQP (Table 6) datasets. Empirical Findings For visualizations of correlations on the other two datasets, please refer to Figure 5 for YELP and Figure 6 for QQP in Appendix E. 5 Discussion Potential Impacts. Our findings seem intuitive but are non-trivial. The NLP models were not trained on perturbed examples when measuring robustness, but still they display a strong correlation with perturbation learnability. Understanding these findings are important for a more principled evaluation of and control over NLP models (Lovering et al., 2020). Specifically, the learnability metric complements to the evaluation of newly designed perturbations by revealing model weaknesses in a clean setup. Reducing perturbation learnability is promising for improving robustness of models. Contrastive learning (Gao et al., 2021; Yan et al., 2021) that pulls the representations of the original and perturbed text together, makes it difficult for the model to identify the perturbation (reducing learnability) and thus may help improve robustness. Perturbation can also be viewed as injecting spurious feature into the examples, so the learnability metric also helps to interpret robustness to spurious correlation (Sagawa et al., 2020). Moreover, learnability may facilitate the development of model architectures with explicit inductive biases (Warstadt and Bowman, 2020; Lovering et al., 2020) to avoid sensitivity to noisy perturbations. Grounding the learnability within the causality framework inspires future researchers to incorporate the causal perspective into model design (Zhang et al., 2020), and make the model robust to different types of perturbations. Limitations. In this work, we focus on the robust accuracy (Section 2.1), which is accuracy on the perturbed test set. We do not assume that the test accuracy of the original test set, a.k.a in-distribution accuracy, is invariant invariant against training with augmentation or not. It would be interesting to investigate the trade-off between robust accuracy and in-distribution accuracy in the future. We also note that this work has not established that the relationship between learnability and robustness is causal. This could be explored with other approaches in causal inference for deconfounding besides simulation on randomized control trial, such as working with real data but stratifying it (Frangakis and Rubin, 2002), to bring the learnability experiment closer to more naturalistic settings. Although we restrict to balanced, binary classification for simplicity in this pilot study, our framework can also be extended to imbalanced, multi-class classification. We are aware that computing average learnability is expensive for large models and datasets, which is further discussed in Section 8. We provide a greener solution in Appendix D. We could further verify our assumptions for perturbations with a user study (Moradi and Samwald, 2021) which investigates how understandable the perturbed texts are to humans. 6 Related Work Robustness of NLP Models to Perturbations. The performance of NLP models can decrease when encountering noisy data in the real world. Recent works (Prabhakaran et al., 2019; Ribeiro et al., 2020; Niu et al., 2020; Moradi and Samwald, 2021) present comprehensive evaluations of the robustness of NLP models to different types of perturbations, including typos, changed entities, negation, etc. Their results reveal the phenomenon that NLP models can handle some specific types of perturbation more effectively than others. However, they do not go into a deeper analysis of the reason behind the difference of robustness between models and perturbations. Interpretation of Data Augmentation. Although data augmentation has been widely used in CV (Sato et al., 2015; DeVries and Taylor, 2017; Dwibedi et al., 2017) and NLP (Wang and Yang, 2015; Kobayashi, 2018; Wei and Zou, 2019), the underlying mechanism of its effectiveness remains under-researched. Recent studies aim to quantify intuitions of how data augmentation improves model generalization. Gontijo-Lopes et al. (2020) introduce affinity and diversity, and find a correlation between the two metrics and augmentation performance in image classification. In NLP, Kashefi and Hwa (2020) propose a KL-divergence–based metric to predict augmentation performance. Our proposed learnability metric implies when data augmentation works better and thus acts as a complement to this line of research. 7 Conclusion This work targets at an open question in NLP: why models are less robust to some textual perturbations than others? We find that learnability, which causally quantifies how well a model learns to identify a perturbation, is predictive of the model robustness to the perturbation. In future work, we will investigate whether these findings can generalize to other domains, including computer vision. 8 Ethics Statement Computing average learnability requires training a model for multiple times at different perturbation probabilities, which can be computationally intensive if the sizes of the datasets and models are large. This can be a non-trivial problem for NLP practitioners with limited computational resources. We hope that our benchmark results of typical perturbations for NLP models work as a reference for potential users. Collaboratively sharing the results of such metrics on popular models and perturbations in public fora can also help reduce duplicate investigation and coordinate efforts across teams. To alleviate the computational efficiency issue of average learnability estimation, using learnability at selected perturbation probabilities may help at the cost of reduced precision (Appendix D). We are not alone in facing this issue: two similar metrics for interpreting model inductive bias, extractability and s-only error (Lovering et al., 2020) also require training the model repeatedly over the whole dataset. Therefore, finding an efficient proxy for average learnability is promising for more practical use of learnability in model interpretation. Acknowledgements This research is supported by the National Research Foundation, Singapore under its International Research Centres in Singapore Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore. We acknowledge the support of NVIDIA Corporation for their donation of the GeForce RTX 3090 GPU that facilitated this research. References Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Kaustubh D Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish Srivastava, Samson Tan, et al. 2021. Nl-augmenter: A framework for task-sensitive natural language augmentation. arXiv preprint arXiv:2112.02721. Debidatta Dwibedi, Ishan Misra, and Martial Hebert. 2017. Cut, paste and learn: Surprisingly easy synthesis for instance detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 1310–1319. IEEE Computer Society. Steffen Eger, Gözde Gül Şahin, Andreas Rücklé, JiUng Lee, Claudia Schulz, Mohsen Mesgar, Krishnkant Swarnkar, Edwin Simpson, and Iryna Gurevych. 2019. Text processing like humans do: Visually attacking and shielding NLP systems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1634–1647, Minneapolis, Minnesota. Association for Computational Linguistics. Constantine E Frangakis and Donald B Rubin. 2002. Principal stratification in causal inference. Biometrics, 58(1):21–29. Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Karan Goel, Nazneen Fatema Rajani, Jesse Vig, Zachary Taschdjian, Mohit Bansal, and Christopher Ré. 2021. Robustness gym: Unifying the nlp evaluation landscape. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, pages 42–55. Raphael Gontijo-Lopes, Sylvia Smullin, Ekin Dogus Cubuk, and Ethan Dyer. 2020. Tradeoffs in data augmentation: An empirical study. In International Conference on Learning Representations. Paul W Holland. 1986. Statistics and causal inference. Journal of the American statistical Association, 81(396):945–960. Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. 2017. First quora dataset release: Question pairs. Rohan Jha, Charles Lovering, and Ellie Pavlick. 2020. Does data augmentation improve generalization in nlp? arXiv preprint arXiv:2004.15012. Omid Kashefi and Rebecca Hwa. 2020. Quantifying the evaluation of heuristic methods for textual data augmentation. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 200–208. pages 119–126, Online. Association for Computational Linguistics. Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 452–457. Brady Neal. 2020. Introduction to causal inference from a machine learning perspective. Course Lecture Notes (draft). Zhenhao Li and Lucia Specia. 2019. Improving neural machine translation robustness via data augmentation: Beyond back-translation. In Proceedings of the 5th Workshop on Noisy User-generated Text (WNUT 2019), pages 328–336, Hong Kong, China. Association for Computational Linguistics. Xing Niu, Prashant Mathur, Georgiana Dinu, and Yaser Al-Onaizan. 2020. Evaluating robustness to input perturbations for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8538– 8544. Liqun Liu, Funan Mu, Pengyu Li, Xin Mu, Jing Tang, Xingsheng Ai, Ran Fu, Lifeng Wang, and Xing Zhou. 2019a. NeuralClassifier: An open-source neural hierarchical multi-label text classification toolkit. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 87–92, Florence, Italy. Association for Computational Linguistics. Bo Pang and Lillian Lee. 2005. Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 115–124. Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, pages 2873–2879. Xiao Liu, Da Yin, Yansong Feng, Yuting Wu, and Dongyan Zhao. 2021. Everything has a cause: Leveraging causal inference in legal text analysis. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1928–1941. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Charles Lovering, Rohan Jha, Tal Linzen, and Ellie Pavlick. 2020. Predicting inductive biases of pretrained models. In International Conference on Learning Representations. Junghyun Min, R Thomas McCoy, Dipanjan Das, Emily Pitler, and Tal Linzen. 2020. Syntactic data augmentation increases robustness to inference heuristics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2339–2352. Milad Moradi and Matthias Samwald. 2021. Evaluating the robustness of neural language models to input perturbations. John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. 2020. TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Vinodkumar Prabhakaran, Ben Hutchinson, and Margaret Mitchell. 2019. Perturbation sensitivity analysis to detect unintended model biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5740–5745. Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of nlp models with checklist. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902– 4912. Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688. Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, and Percy Liang. 2020. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning, pages 8346–8356. PMLR. Elizabeth Salesky, David Etter, and Matt Post. 2021. Robust open-vocabulary translation from visual text representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7235–7252. Ikuro Sato, Hiroki Nishimura, and Kensuke Yokoi. 2015. Apac: Augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229. Samson Tan and Shafiq Joty. 2021. Code-mixing on sesame street: Dawn of the adversarial polyglots. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3596–3616, Online. Association for Computational Linguistics. Lifu Tu, Garima Lalwani, Spandana Gella, and He He. 2020. An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8:621–633. William Yang Wang and Diyi Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2557–2563, Lisbon, Portugal. Association for Computational Linguistics. Xiao Wang, Qin Liu, Tao Gui, Qi Zhang, et al. 2021. Textflint: Unified multilingual robustness evaluation toolkit for natural language processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 347–355, Online. Association for Computational Linguistics. Alex Warstadt and Samuel R Bowman. 2020. Can neural networks acquire a structural bias from raw linguistic data? In Proceedings of the Annual Meeting of the Cognitive Science Society. Jason Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 6382–6388. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, and Weiran Xu. 2021. ConSERT: A contrastive framework for self-supervised sentence representation transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5065–5075, Online. Association for Computational Linguistics. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 5753–5763. Cheng Zhang, Kun Zhang, and Yingzhen Li. 2020. A causal view on robustness of neural networks. Advances in Neural Information Processing Systems, 33:289–301. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28:649–657. A Algorithm for Perturbation Learnability Estimation Algorithm 1 Learnability Estimation Input: training set Dtrain = {(x1 , l1 ), ..., (xn , ln )}, test set Dtest = D = {(xn+1 , ln+1 ), ..., (xn+m , ln+m )}, Dtrain ∪ Dtest , model f ∶ (x; θ) ↦ {0, 1}, perturbation g ∶ (x; β) → x∗ , perturbation probability p Output: learnability(f, g, p, D) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: // ¬ assigning random labels Initialize an empty dataset D′ for i in {1, 2, ..., n + m} do li′ ← randint[0, 1] D′ ← D′ ∪ {(xi , li′ )} end for // ­ perturbing with probabilities Initialize an empty dataset D′∗ for i in {1, 2, ..., n + m} do z ← rand(0, 1) x∗i ← xi if li′ = 1 ∧ z < p then x∗i ← g(xi ) end if D′∗ ← D′∗ ∪ {(x∗i , li′ )} end for // ® estimating model performance ′ ′ Dtrain , Dtest ← D′ [1 ∶ n], D′ [n + 1 ∶ n + m] ′∗ ′∗ Dtrain , Dtest ← D′∗ [1 ∶ n], D′∗ [n+1 ∶ n+m] ′∗ fit the model f (⋅) on Dtrain ′∗ ′∗ 21: A(f, g, p, Dtest ) ← f (⋅) accuracy on Dtest ′ ′ 22: A(f, g, p, Dtest ) ← f (⋅) accuracy on Dtest ′ ′∗ 23: return A(f, g, p, Dtest ) − A(f, g, p, Dtest ) 20: B Background on Causal Inference The aim of causal inference is to investigate how a treatment T affects the outcome Y . Confounder X refers to a variable that influences both treatment T and outcome Y . For example, sleeping with shoes on (T ) is strongly associated with waking up with a headache (Y ), but they both have a common cause: drinking the night before (X) (Neal, 2020). In our work, we aim to study how a perturbation (treatment) affects the model’s prediction (outcome). However, the latent features and other noise usually act as confounders. Causality offers solutions for two questions: 1) how to eliminate the spurious association and isolate the treatment’s causal effect; and 2) how varying T affects Y , given both variables are causallyrelated (Liu et al., 2021). We leverage both of these properties in our proposed method. Let us now introduce Randomized Controlled Trial and Average Treatment Effect as key concepts in answering the above two questions, respectively. Randomized Controlled Trial (RCT). In an RCT, each participant is randomly assigned to either the treatment group or the non-treatment group. In this way, the only difference between the two groups is the treatment they receive. Randomized experiments ideally guarantee that there is no confounding factor, and thus any observed association is actually causal. We operationalize RCT as a perturbation classification task in Section 3.1. Average Treatment Effect (ATE). In Section 3.2, we apply ATE (Holland, 1986) as a measure of learnability. ATE is based on Individual Treatment Effect (ITE, Equation 9), which is the difference of the outcome with and without treatment. IT Ei = Yi (1) − Yi (0). (9) Here, Yi (1) is the outcome Y of individual i that receives treatment (T = 1), while Yi (0) is the opposite. In the above example, waking up with a headache (Y = 1) with shoes on (T = 1) means Yi (1) = 1. We calculate the Average Treatment Effect (ATE) by taking an average over ITEs: AT E = E[Y (1)] − E[Y (0)]. (10) ATE quantifies how the outcome Y is expected to change if we modify the treatment T from 0 to 1. We provide specific definitions of ITE and ATE in Section 3.2. C Alternate Definition of Perturbation Learnability In Section 3.2, we propose an accuracy-based identification of ATE. Now we discuss another probability-based identification and compare between them. We can also define the outcome Y of a test example xi as the predicted probability of (pseudo) true label given by the trained model f (⋅): Yi (0) ∶= Pf (L′ = li′ ∣ X = xi ) ∈ (0, 1). (11) Similarly, the performance outcome Y of a perturbed test data point x∗i is: Yi (1) ∶= Pf (L′ = li′ ∣ X = x∗i ) ∈ (0, 1). (12) For example, for a test example (xi , li′ ) which receives treatment (li′ = 1), the trained model f (⋅) predicts its label as 1 with only a small probability 0.1 before treatment (it has not been perturbed yet), and 0.9 after treatment. So the Individual Treatment Effect (ITE, see Equation 9) of this example is calculated as IT Ei = Yi (1) − Yi (0) = 0.9 − 0.1 = 0.8. We then take an average over all the perturbed test examples (half of the test set) as Average Treatment Effect (ATE, see Equation 10), which is exactly the learnability of a perturbation for a model. To clarify, the two operands in Equation 10 are defined as follows: ′∗ E[Y (1)] ∶= P(f, g, p, Dtest ). (13) It means the average predicted probability of (pseudo) true label given by the trained model f (⋅) ′∗ on the perturbed test set Dtest . ′ E[Y (0)] ∶= P(f, g, p, Dtest ). (14) Similarly, this is the average predicted probability ′ on the randomly labeled test set Dtest . Notice that the accuracy-based definition of outcome Y (Equation 6) can also be written in a similar form to the probability-based one (Equation 11): Yi (0) ∶= 1{f (xi )=li′ } = 1{Pf (L′ =li′ ∣X=xi )>0.5} ∈ {0, 1}. (15) because the correctness of the prediction is equal to whether the predicted probability of true (pseudo) label exceeds a certain threshold (i.e., 0.5). The major difference is that, accuracy-based IT E is a discrete variable falling in {−1, 0, 1}, while probability-based IT E is a continuous one ranging from -1 to 1. For example, if a model learns to identify a perturbation and thus changes its prediction from wrong (before perturbation) to correct (after perturbation), accuracy-based IT E will be 1 − 0 = 1 while probability-based IT E will be less than 1. That is to say, accuracy-based AT E tends to vary more drastically than probability-based if inconsistent predictions occur more often, and thus can better capture the nuance of perturbation learnability. Empirically, we find that accuracy-based average learnability varies greatly (σ = 0.375, Table 4) and thus can better distinguish between different model-perturbation pairs than probabilitybased one (σ = 0.288, Table 4). As a result, we choose accuracy-based ATE as the primary measurement of learnability in this paper. D Investigating Learnability at a Specific Perturbation Probability Inspired by Precision @ K in Information Retrieval (IR), we propose a similar metric dubbed Learnability @ p, which is the learnability of a perturbation for a model at a specific perturbation probability p. We are primarily interested in whether a selected p can represent the learnability over different perturbation probabilities and correlates well with robustness and post data augmentation ∆. We calculate the standard deviation (σ) of Learnability @ p and average learnability (log AU C) over all model-perturbation pairs to measure how well it can distinguish between different models and perturbations. Table 4 shows that average learnability is more diversified than all Learnability @ p and diversity (σ) peaks at p = 0.01 for accuracybased/probability-based measurement. Accuracybased Learnability @ p is generally more diversified across models and perturbations than its counterpart. To investigate the strength of the correlations, we also calculate Spearman ρ between accuracy-based/probability-based learnability @ p vs. average learnability/robustness/post data augmentation ∆ over all model-perturbation pairs. Table 4 shows that generally average learnability has stronger correlation than Learnability @ p. Correlations with both robustness and post data augmentation ∆ peak at p = 0.02 for accuracybased/probability-based measurements, and the correlations with average learnability (0.816*/0.886*) are also strong at these perturbation probabilities. Overall, Learnability @ p with higher standard deviation correlates better with average learnability, robustness and post data augmentation ∆. Our analysis shows that if p is carefully selected by σ, Learnability @ p is also a promising metric, though not as accurate as average learnability. One advantage of Learnability @ p over average learnability is that it costs less time to obtain learnability at a single perturbation probability. E Additional Experiment Results Accuracy-based Learnability @ p p Probability-based Learnability @ p σ Avg Learn. Robu. Post Aug ∆ σ Avg Learn. Robu. Post Aug ∆ Avg. 0.375 1.000* -0.643* 0.756* 0.288 1.000* -0.652* 0.727* 0.001 0.005 0.01 0.02 0.05 0.1 0.5 1.0 0.182 0.235 0.263 0.257 0.236 0.241 0.094 0.011 0.426* 0.637* 0.741* 0.816* 0.279 0.354* 0.024 -0.199 -0.265 -0.383* -0.530* -0.636* -0.158 -0.162 0.155 0.252 0.259 0.522* 0.635* 0.743* 0.136 0.192 -0.179 -0.332 0.114 0.192 0.192 0.192 0.121 0.115 0.037 0.019 0.367* 0.925* 0.893* 0.886* 0.576* 0.543* -0.080 -0.220 -0.279 -0.620* -0.567* -0.686* -0.371* -0.288 0.114 0.294 0.288 0.702* 0.586* 0.690* 0.350* 0.258 -0.258 -0.402* 0.25 0.1 0.20 0.0 0.15 0.1 0.5 = 0.846 * 1.0 1.5 2.0 avg learnability (a) Learnability vs. Robustness 0.10 0.05 0.2 = 0.821 * 0.5 0.00 1.0 1.5 2.0 avg learnability (b) Learnability vs. Post Aug ∆ 2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 avg learnability 0.2 post aug 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 post aug robustness Table 4: Standard deviations (σ) of Learnability @ p and Spearman correlations between accuracybased/probability-based learnability @ p vs. average learnability/robustness/post data augmentation ∆ over all model-perturbation pairs on IMDB dataset. ∗ indicates significance (p-value < 0.05). 0.4 0.3 0.2 robustness 0.1 0.0 (c) Learn. vs. Robu. vs. Post Aug ∆ Figure 5: Linear regression plots of learnability vs. robustness vs. post data augmentation ∆ on YELP dataset. Each point in the plots represents a model-perturbation pair. ρ is Spearman correlation. ∗ indicates high significance (p-value < 0.001). 1.8 0.1 0.1 0.0 0.1 0.3 = 0.695 * 0.5 1.0 1.5 avg learnability (a) Learnability vs. Robustness 1.6 1.4 avg learnability 0.2 0.2 post aug 0.1 post aug robustness 0.0 0.2 1.2 0.0 1.0 0.8 0.1 0.2 = 0.75 * 0.5 1.0 1.5 avg learnability (b) Learnability vs. Post Aug ∆ 0.6 0.2 0.3 0.2 0.1 robustness 0.0 0.4 (c) Learn. vs. Robu. vs. Post Aug ∆ Figure 6: Linear regression plots of learnability vs. robustness vs. post data augmentation ∆ on QQP dataset. Each point in the plots represents a model-perturbation pair. ρ is Spearman correlation. ∗ indicates high significance (p-value < 0.001). Perturbation shuffle_word butter_fingers_perturbation whitespace_perturbation insert_abbreviation random_upper_transformation visual_attack_letters leet_letters RoBERTa XLNet TextRNN BERT Average over models 1.538 1.301 1.276 1.437 1.432 2.060 2.083 1.586 1.433 1.449 1.370 1.828 2.006 1.947 0.401 1.425 1.720 2.241 1.733 2.030 2.359 1.854 1.758 1.569 1.572 1.715 1.808 1.824 1.345 1.479 1.504 1.655 1.677 1.976 2.053 Table 5: Average learnability (log AU C of corresponding curve in Figure 3) of each model–perturbation pair on YELP dataset. Rows are sorted by average values over all models. The perturbation for which a model is most learnable is highlighted in bold while the following one is underlined. Perturbation whitespace_perturbation duplicate_punctuations butter_fingers_perturbation insert_abbreviation random_upper_transformation shuffle_word visual_attack_letters leet_letters RoBERTa TextRNN XLNet BERT Average over models 0.732 0.722 0.555 0.820 1.062 1.231 1.429 1.720 0.399 0.823 0.878 1.440 0.664 0.816 1.810 1.676 0.562 0.640 0.775 0.960 1.392 1.552 1.744 1.840 0.711 0.872 1.022 1.206 1.483 1.623 1.608 1.718 0.601 0.764 0.808 1.107 1.150 1.306 1.648 1.738 Table 6: Average learnability (log AU C of corresponding curve in Figure 3) of each model–perturbation pair on QQP dataset. Rows are sorted by average values over all models. The perturbation for which a model is most learnable is highlighted in bold while the following one is underlined.