ACL 2020 highlights – Joe

I had a great time at ACL this week. There are many great papers and I’m still going through them. Here’s a summary of just a few that I wanted to highlight. I’d love to get thoughts and retorts from anyone reading!

“To Test Machine Comprehension, Start by Defining Comprehension”

by Jesse Dunietz, Gregory Burnham, Akash Bharadwaj, Owen Rambow, Jennifer Chu-Carroll, and David Ferrucci

Like most great ideas, the framework presented here is simple – seemingly obvious, even. They take a specific look at Machine Reading Comprehension (MRC) and argue that current evaluation metrics don’t really inspire much confidence in the system’s comprehension of the relevant information in the passage to make it trust it in any real-world setting. They argue that rather than making questions harder, we should explicitly defining so-called “Templates of Understanding” to measure the different dimensions of comprehension within a particular context. For example, in the context of a story, they lay out the following ToU:

The authors do a great job thinking with clarity and simplicity about how we should approach evaluating MRC systems.

“Intermediate-Task Transfer Learning with Pretrained Language Models”

by Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut,
Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann,
Samuel R. Bowman

Recently the pre-train/fine-tune paradigm has become ubiquitous. This paper explores whether we can take advantage of labeled data during an intermediate training step. The authors do really extensive analysis on what kinds of datasets are useful for intermediate training and what downstream tasks they have a positive (or negative) effect on.

A really interesting insight for me is that commonsense tasks don’t ever seem to have a negative effect. They either help on the downstream task, or don’t have much of an effect at all. I wonder if this because we do have labeled commonsense data that is used, or if we could build some kind of unsupervised commonsense objective into the pre-training procedure that would work just as well.

“Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data”

by Emily M. Bender and Alexander Koller

This paper is not focused around any one method or technique, but rather makes a general and pretty bold argument: meaning cannot be learned from form. In other words, just giving a model access to a whole bunch of text will never be enough to learn meaningfully about the real world.

Whether you buy their argument or not, I found it to be an intellectually stimulating presentation. I suspect the hyperintelligent octopus argument will be one that sticks around for a long time.

I also appreciated their word of caution about the way we use different words when communicating about a model’s capabilities. At the very end of the presentation, Alexander warned,

As a community, let’s remember that we’re scientists and not marketing people. Let’s be a little bit careful when we use terms like understanding, meaning, and comprehension.


Intermediate Task Transfer is a very practical one. It exhaustively provides many results that can help engineers save much time. :smiley:

Thanks so much Joe @joeddav , I was trying to catch up all tutorials and workshops within my limited time, and I almost miss these extremely interesting papers.

I just finish watching the first paper (To Test Machine Comprehension, Start by Defining Comprehension – in a sense it has the same spirit as the CheckList best paper), and found that their slides and rocket discussions are very valuable! Sadly, these materials will be deleted soon, so I took a quick cap screens of some slides and would like to post supplementary materials mentioned in Rocket Chat here. Hopefully it can be useful for other people.

Temporal access to dataset used in the paper :

Related papers suggested by Sowmya Vajjala
Developing reading comprehension questions
and commneted by Jesse (the author)

My initial reaction is that the progression of “types of comprehension” listed there lay out a massive challenge for scaffolding up MRC to richer abilities. I don’t think people have been explicit about generating questions according to these categories, but many of them do appear in MRC datasets. Mostly people seem to focus on literal comprehension, throwing in reorganization/inference when they want to make the test harder. Prediction is sometimes tested as part of commonsense reasoning (e.g., Story Cloze).

As for how these categories relate to ToUs, I think it would mostly be as forms of error analysis. You’d establish in advance that you want your system to figure out from this text that Maria died at age 55, and then when it succeeds/fails, you’d want to count that in the “reorganization” bucket. I’m not sure how important the categories would be for generating questions, though—our argument is that questions should be generated in accordance with what content downstream applications need, not what mode of reasoning would be needed to get there.

Reut Tsarfaty asked a great question on ‘motivational’ perspective :

I am particularly interested in the "motivational’, It seems you conflate it with “what if”, but this is a very small fragment of motivation sources. Motivation can come from goals (“teleological”) “We are set to achieve our financial goals at Q2”, personal prefs (“buletic”) “I prefer to sit outside”, morals (“deontic”) “you should not drink and drive”, and more. Did you have thoughts on structuring this space of (sources of) motivations for the prescribed events?

And the author replied in some valuable thoughts :

  • Thanks, Reut, and great question! You’ve put your finger on a point our exposition glossed over. We do actually allow for all of the types of motivation your listed, though there are probably others we haven’t yet encountered and will have to figure out how to handle.

In our scheme, any given explanation, whether mechanistic or motivational, has three main structural elements:

  1. The “root cause.”
  2. A series of “causal links” connecting the root cause to the outcome (as shown in Fig. 2 of the paper).
  3. The recursive explanations for the root cause and for each causal link, each of which consists of a) a general causal rule (“most dogs prefer not getting rained on to getting rained on”) and b) supporting facts that establish the causal rule applies (“Rover is a dog”).

In motivational explanations—i.e., explanations where an agent is portrayed as taking a deliberate action—the root cause is always some form of preference over states expected to follow or happen concurrently with the action. In that sense, it does indeed have to be some sort of “what if”—e.g., if Timmy doesn’t take this action, he won’t get to sit outside. But the preference can be any form of desirability/undesirability. Here’s how we might handle the cases you listed:

  • Joanna would prefer that the organization achieve its Q2 financial goals than that it fall short of them.
  • Timmy would prefer sitting outside rather than inside.
  • Alice driving drunk would violate her moral standards, whereas driving in a normal state of mind would not.
    …and each would be recursively explained in terms of some general rule about what makes people consider such things desirable/undesirable. In the final case, that would probably mean stating that people generally think driving drunk is immoral.

Now, theoretically each statement of preference should be connected to the corresponding action by a general rule—e.g.:

  • Joanna cancels the event, rather than leaving it scheduled, because:
  • Joanna would prefer that the organization achieve its Q2 financial goals than that it fall short of them.
  • Joanna expects that;
  • <imagined causal chain connecting canceling/not canceling to meeting/falling short of goals>
    • When an agent prefers outcome X to outcome X’, and they believe action A will lead to outcome X whereas action A’ will lead to outcome X’, they often take action A instead of action A’. *

But it’s unwieldy to include such a foundational piece of agentive behavior in every motivational explanation, so we allow annotators to assume it. Currently we have a small list of such general rules that annotators can assume:
• Agents act to realize their preferences.
• Agents act to fulfill their obligations.
• Agents act to conform to their moral standards.
(These are shorthand versions of the more unwieldy contrastive rules.)

I believe it’s that list that you were correctly pointing out we need; is that right?

and more :

  • The possible-worlds notion is definitely underlying our whole approach to describing causality and motivation: we’re assuming a Lewis-like notion of a nearby possible world where everything is the same except for one little tweak. (Important differences include that we don’t care whether possible worlds are metaphysically “real” and that we sometimes consider multiple nearby worlds if there are multiple salient contrasts.)

  • So far we’ve been sticking with plain English as the annotation format, so that we can work out all the content and conceptual structures intuitively without first committing to a formalism. That makes explicit formal semantics hard to incorporate. But in other corners of Elemental Cognition—particularly the ones working on systems that can actually _ produce _ answers like this—we are indeed doing some formal representation, and we’ve discussed the need to represent various kinds of irrealis contexts, including the alternative possible worlds evoked by causal chains.

Lastly, Emily Bender (the author of the last Octupus-argument paper that @joeddav mentioned) also joined the discussions. But I am not sure I should post them here since they are extremely long (50+ replies)

1 Like

Stunningly, regarding the Octopus paper (Bender & Koller 2020) which contains a challenge on “advice on Bear chasing”, Gwern has tested this example with GPT-3, and found that GPT-3 can make many valid suggestions to deal with a bear