ACL 2020 highlights - Yacine

These are some of the papers I discovered at this year’s ACL conference. I focused on three main themes:

  • Model Analysis
  • (Conditional) Text Generation
  • Society & Ethics and NLP

I tried to provide a short summary for each of the papers outlining the methods and contributions: please refer to the papers themselves for more details, they are all well worth the read!

I was particularly impressed by the depth of thinking in a lot of the papers accepted to the Ethics & NLP track, and would love to have further conversations about them here!

Link to the Google Docs version

Model Analysis

Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?

This work proposes an experimental setup based on asking humans to simulate model behaviour to evaluate how much insight various visualization and explainability methods actually give users. In the first experiments they proposed, users are asked to predict model outputs, then shown explanations for these outputs provided by automated tools. They are then asked to predict outputs for a new set of examples, and the usefulness of the automatic explanation tools is measured by how much their accuracy improves in this second stage. Another experiment shows user model outputs and explanations, and asks them to predict the model behavior on counterfactual examples where the input is perturbed in a targeted fashion. The authors show that the measured accuracy improvements give more interpretable and reliable information about the quality of the explanation tool than subjective Likert-scale judgments. Replicating this study at a larger scale seems a promising way to evaluate explanation tools

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

This paper proposes a framework to develop checklists: suites of tests that can be applied to various NLP models to check for “bugs”. One significant difference between the proposed checklist approach and the benchmarks that have been guiding the progress of the field is that the former is more targeted: instead of reporting the average performance of a model across a large test set created through crawling or crowd-sourcing, it proposes to come up with a set of simple unit tests corresponding to use cases we want to ensure our systems succeed at before they can be deployed and used. In order to make this process systematic and affordable, one important contribution of this work is a set of tools which allow practitioners to easily and efficiently design such testing suites by providing an intuitive UI and leveraging models to suggest likely test examples. Allowing people to easily develop, share and aggregate these test suites has the potential to significantly increase user trust in NLP models.

Conditional Generation

Asking and Answering Questions to Evaluate the Factual Consistency of Summaries

FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

These two concurrent papers take a similar approach to evaluating the actuality of generated abstractive summaries of news text, but are complementary in their implementation and analysis. The basic idea is that we can check whether a summary conveys information that is faithful to the source material by checking that a question answering system will give similar answers when using either as supporting document. The questions are generated through a two-step process: first, use a heuristic to identify spans in the summary we want to check for accuracy, then, use an automatic question generation system to obtain questions whose answers should be those spans. If a machine reading comprehension system finds the same answer to the question when reading the article as when reading the summary, the information is probably correct. FEQA and QAGS differ in how they filter the candidate spans and how they compare agreement, but both find that question based metrics correlate better with human judgments of factuality than other metrics. One caveat however is that both methods work better on CNN/DM than on Xsum, which is more abstractive in nature. Finally, QAGS note that in addition to being used as an aggregated automatic metric, these methods can be useful for visualizing specific examples in human-in-the-loop settings.

On Faithfulness and Factuality in Abstractive Summarization

This paper further investigates the state of the art for the factuality/faithfulness of abstractive summarization by providing a large-scale human evaluation of the hallucinations produced by recently published systems. This work classifies the hallucinations into an intrinsic (model misunderstands the input) and extrinsic (model invents completely new facts) category. Note that in this setting, factual information is still considered to be a hallucination if it’s not in the input. The paper focuses on Xsum (one sentence summaries, abstractive in nature), and provides annotations for the output of models published up to 2019. As a result, large-scale pre-trained seq2seq models (T5, BART) are missing. Can use NLI for summary selection to improve faithfulness at the cost of ROUGE. The annotations are available at::

Exploring Content Selection in Summarization of Novel Chapters

The authors take some step towards training a book chapter summarization model: namely, they gather summaries of public domain book chapters from study guide websites, use these to align book sentences to summary sentences using IDF-weighted ROUGE (which seems to work better than plain ROUGE, METEOR, or BERT - would be interesting to see BLEURT/BertScore results), and train an RNN-based extractive summarization system using these noisy labels. The authors still have to release their pre-processed data and (hopefully) noisy labels, but this is a nice foray into long-input summarization outside of the news/scientific article domain.

Dataset Information

About 8000 chapters from Gutenberg project books with 2-5 summaries per chapter gathered from study guide websites (licensing!). Chapters are ~5,200 words, summaries are ~370 words

Script to re-construct dataset at

Leveraging Pre-trained Checkpoints for Sequence Generation Tasks

The paper explores how we can use pre-trained encoder-only and decoder-only models to warm-start an encoder-decoder model. While their methods still lags behind full encoder-decoder pre-trained models (R1 on Xsum for their method: 41.45 vs BART: 45.14), they show some improvements over the baselines by initializing encoder and decoder weights with Roberta checkpoints, and randomly initializing cross-attention. The model can even be made more memory-efficient by sharing encoder and decoder weights.

Improved Natural Language Generation via Loss Truncation

Not specific to conditional generation. The authors argue that log-likelihood as a loss is not robust to noise since it needs to have probability mass on every seen example (including outliers). Instead, our primary aim should be to ensure that generations from the model are indistinguishable from natural language data. The authors show that a truncated log-likelihood loss can serve as an upper bound for a measure of distinguishability. Generations from the full output distribution of a model trained with truncated loss are rated better than top-k or top-p sampling for a model trained with the full loss when evaluated with HUSE.

Society & Ethics and NLP

Social Biases in NLP Models as Barriers for Persons with Disabilities

The authors consider the effect of the mention of disability on sentiment and toxicity classifiers, and the subsequent impact on the life and discourse of people with disabilities. They show that commonly used classifiers consistently associate higher toxicity score and more negative sentiment score, which would among other things expose people to a heavier burden of content moderation false positives when talking about their own disability. The authors trace these biases in part to BERT model behavior and to dynamics of the training data. The authors also discuss the necessity of involving the affected communities in work about ableism in ML and NLP, and describe which resources from advocacy groups they relied on for their experimental design.

Social Bias Frames: Reasoning about Social and Power Implications of Language

The authors propose a new annotation scheme for offensive language which goes beyond binary hate speech classification and focuses on the intent of the utterance: the annotators are asked to identify the target group, whether the utterance is an instance of in-group speech, and to explicitly write out the offensive implication. The authors created a fairly large dataset of 45k posts from a variety of sources using these guidelines and fine-tuned a GPT2 model to predict the frames. The model has some initial success but still leaves room for improvement, especially to generate better explanations.

Dataset Information

The dataset consists of 45k utterances collected from Twitter, Reddit, as well as known hate sites. 42% are classified as offensive, only about 5% have the in-group annotation. The total data is made up of 150k training pairs since several posts target multiple groups.

The paper provides a section on ethical considerations of making and using the dataset and describes the demographic makeup of the annotators.

Language (Technology) is Power: A Critical Survey of “Bias” in NLP

The authors start by reviewing a large number of papers on bias in NLP systems, and find that there is a common lack of rigorous definition or motivation of the problem they aim to address, inconsistencies in the way bias is defined across the field, and general lack of engagement with relevant sociolinguistic work. As a result, the authors propose a set of recommendations for future work which include: grounding work on in the relevant literature outside of NLP that explores the relationships between language and social hierarchies, explicitly stating why the system behaviors described are harmful, in what ways, and to whom, and examining language use in practice by engaging with the lived experiences of members of communities affected by NLP systems. To illustrate how these recommendations can be interpreted in practice, the authors present a case study of African American English. The whole paper is packed with citations to relevant recent work that make up a necessary reading list for NLP practitioners aiming to think more deeply about the societal impact of their work.

Give Me Convenience and Give Her Death: Who Should Decide What Uses of NLP are Appropriate, and on What Basis?

The authors analyse an EMNLP 2019 paper on automatic legal sentencing as a case study for learning how to work toward an ethical assessment of works in the field. Specifically, the work relies on previously published recommendations for data statements (Bender and Friedman, 2018) and dataset sheets (Gebru et al., 2018) to ask and answer a number of fundamental questions about the creation and use of the dataset. The paper then describes the concept of dual use, encouraging dataset and algorithm creators to consider whether alternative uses of their work may have nefarious effects. Overall, this paper can be a good introduction to the above cited works specifically and ethical considerations about work in NLP more broadly.