EMNLP Picks from the Hugging Face Science Team

The Hugging Face :hugs: team had a great time attending EMNLP the other week. Virtual conferences are tricky, but I personally have come to enjoy some aspects of it like the pre-recorded presentations and gather.town mingling. And not having to travel is a plus, too :earth_asia::seedling:

Last week a few of us on the science team tried to each select 4-5 presentations we’d recommend others on the team to check out. I’ve compiled our suggestions and included them here for those of you that are interested in our picks & very brief comments. Included are suggestions from myself, @VictorSanh, @yjernite, and @canwenxu (including a couple repeats).

There was an incredible amount of high-caliber work and we couldn’t share all but a few that we thought our team might be interested in, so free to respond with any suggestions (or comments) of your own!

Victor’s picks (@VictorSanh)

BLEU might be Guilty but References are not Innocent

Paper: https://arxiv.org/abs/2004.06063
Presentation: https://slideslive.com/38938647

Discuss a new reference generation method for calculating more reliable automatic scores (including BLEU) that correlate better with human judgement. + a dataset of references (included in sacrebleu i believe)

Learning from Task Descriptions

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.105.pdf
Presentation: https://slideslive.com/38939344

Introduce a new dataset for structured task-oriented evaluation on unseen tasks (0-shot settings) conditioned on a description of the task in natural language. (nice discussion, less convinced by the dataset itself)

Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.16/
Presentation: https://slideslive.com/38939219

Model can learn to represent linguistic features with little pretraining data, but require orders of magniutde more data to learn to prefer linguistic generalization over surface ones (but it is slow…)

Reformulating Unsupervised Style Transfer as Paraphrase Generation

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.55/
Presentation: https://slideslive.com/38938942

Propose simple method based on fine-tuning pretrained language models on automatially generated paraphrase data + discusses weaknesses in automatic metrics of style transfer + release of 15M dataset of style transferthe 5th one: I found the talk of Emmanuel Dupoux at Conll very informative

Yacine’s picks (@yjernite)

ETC: Encoding Long and Structured Inputs in Transformers

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.19
Presentation: https://slideslive.com/38938951/etc-encoding-long-and-structured-inputs-in-transformers

Has local attention and a one global attention token per sentence which is trained with a contrastive loss similar to ICT.

A* Beam Search

Presentation: https://slideslive.com/38939414/bestfirst-beam-search

A* algorithm is not quite as easy to batch as regular beam search, but leads to better and more diverse n-best.

F2-Softmax: Diversifying Neural Text Generation via Frequency Factorized Softmax

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.737/
Presentation: https://slideslive.com/38938686

Pretty simple idea: groups tokens into bins of equal probability mass for a hierarchical softmax so the model can focus on choosing between candidates with the same prior. Leads to a nice improvement on human evaluation and generation diversity metrics.

Towards Reasonably-Sized Character-Level Transformer NMT by Finetuning Subword Systems

Comments: https://www.aclweb.org/anthology/2020.emnlp-main.203
Presentation: https://slideslive.com/38938871

Pre-trains on BPE and fine-tunes on full character decomposition to get the model to train faster.

Towards Debiasing NLU Models from Unknown Biases

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.613
Presentation: https://slideslive.com/38938901

Related to @VictorSanh’s recent paper: the “biases” tend to show up in easy-to-learn examples, so the model down-weight examples that are classified correctly early in training.

Canwen’s picks (@canwenxu)

Experience Grounds Language

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.703.pdf
Presentation: https://slideslive.com/38938907

This may be the paper that defines the future direction of NLP. What should a model learn and what ability should a model have? You can find a good guess from this paper.

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.634.pdf
Presentation: https://slideslive.com/38938976

Yes we know that fine-tuning a pretrained language model can bring the problem of forgetting. Mixout is a valid solution but this EMNLP paper proposes an easy-to-use optimizer to resolve the problem.

Do sequence-to-sequence VAEs learn global features of sentences?

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.350.pdf
Presentation: https://slideslive.com/38939119

It’s a little surprising to see this title cuz we all thought of course VAEs do. However, through well-designed experiments, the authors reveal the other side of this claim.

Pre-Training Transformers as Energy-Based Cloze Models

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.20.pdf
Presentation: https://slideslive.com/38939095

It’s a really cool idea and it makes sense mathematically. Though the results are modest, there’re definitely more to explore.

BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.633.pdf
Presentation: https://slideslive.com/38938938

Self-promoting. It’s a really neat idea that you can compress a model by simply replacing their components. No additional loss function needed.

My picks

Learning from Task Descriptions

Paper : https://www.aclweb.org/anthology/2020.emnlp-main.105.pdf
Presentation : https://slideslive.com/38939344

@VictorSanh mentioned this one but I want to include it as well. They create a new dataset trying to generalize from one set of tasks to another using only task descriptions w/o training data. It’s an ambitious idea to try to formalize and evaluate but I appreciated the work. I’m actually taking a break from adding their dataset “zest” to :hugs:Datasets to compile this post, so it should be up very soon.

Universal Natural Language Processing with Limited Annotations: Try Few-shot Textual Entailment as a Start

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.660
Presentation: https://slideslive.com/38939094

Another approach to “universal” NLP w/ cross-task generalization. The idea here is to pose various tasks as one task (natural language inference) enabling transferability between tasks. Incidentally, the first author is the same who introduced the NLI-based zero-shot classification approach which is roughly the same as the one we now use in our zero-shot pipeline & API.

Text Classification Using Label Names Only: A Language Model Self-Training Approach

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.724
Presentation: https://slideslive.com/38938946

Similar to the “zero-shot” setup of Schick et al.'s PET and Yin et al.'s entailment-based approach (though they refer to it as “weak supervision” here). A nice difference from previous work is that they create groups of synonyms to a class label which can be used as a class representation instead of the class name alone. Another demonstration of self-training with unlabeled data only working well for classification.

Experience Grounds Language

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.703.pdf
Presentation: https://slideslive.com/38938907

Really nice kinda philosophical paper about computational understanding of language. They lay out different “world scopes” to help think about different levels of understanding/experience. Reminiscent in some ways of Bender & Koller’s ACL paper this year, “Climbing towards NLU” and their superintelligent octopus.


Especially like the linguistic shout-outs in there like Warstad et al. It’s always nice to see authors go back and see what (generativist) linguist theory has been saying for perhaps over sixty years and find ways to link that with how LMs “learn” grammar. I’ll be having some time off soon, can’t wait to catch up with all these latest developments! Thanks for the distillation, (pardon the pun)!