The Hugging Face team had a great time attending EMNLP the other week. Virtual conferences are tricky, but I personally have come to enjoy some aspects of it like the pre-recorded presentations and gather.town mingling. And not having to travel is a plus, too
Last week a few of us on the science team tried to each select 4-5 presentations we’d recommend others on the team to check out. I’ve compiled our suggestions and included them here for those of you that are interested in our picks & very brief comments. Included are suggestions from myself, @VictorSanh, @yjernite, and @canwenxu (including a couple repeats).
There was an incredible amount of high-caliber work and we couldn’t share all but a few that we thought our team might be interested in, so free to respond with any suggestions (or comments) of your own!
Victor’s picks (@VictorSanh)
BLEU might be Guilty but References are not Innocent
Discuss a new reference generation method for calculating more reliable automatic scores (including BLEU) that correlate better with human judgement. + a dataset of references (included in sacrebleu i believe)
Learning from Task Descriptions
Introduce a new dataset for structured task-oriented evaluation on unseen tasks (0-shot settings) conditioned on a description of the task in natural language. (nice discussion, less convinced by the dataset itself)
Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)
Model can learn to represent linguistic features with little pretraining data, but require orders of magniutde more data to learn to prefer linguistic generalization over surface ones (but it is slow…)
Reformulating Unsupervised Style Transfer as Paraphrase Generation
Propose simple method based on fine-tuning pretrained language models on automatially generated paraphrase data + discusses weaknesses in automatic metrics of style transfer + release of 15M dataset of style transferthe 5th one: I found the talk of Emmanuel Dupoux at Conll very informative
Yacine’s picks (@yjernite)
ETC: Encoding Long and Structured Inputs in Transformers
Has local attention and a one global attention token per sentence which is trained with a contrastive loss similar to ICT.
A* Beam Search
A* algorithm is not quite as easy to batch as regular beam search, but leads to better and more diverse n-best.
F2-Softmax: Diversifying Neural Text Generation via Frequency Factorized Softmax
Pretty simple idea: groups tokens into bins of equal probability mass for a hierarchical softmax so the model can focus on choosing between candidates with the same prior. Leads to a nice improvement on human evaluation and generation diversity metrics.
Towards Reasonably-Sized Character-Level Transformer NMT by Finetuning Subword Systems
Pre-trains on BPE and fine-tunes on full character decomposition to get the model to train faster.
Towards Debiasing NLU Models from Unknown Biases
Related to @VictorSanh’s recent paper: the “biases” tend to show up in easy-to-learn examples, so the model down-weight examples that are classified correctly early in training.
Canwen’s picks (@canwenxu)
Experience Grounds Language
This may be the paper that defines the future direction of NLP. What should a model learn and what ability should a model have? You can find a good guess from this paper.
Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting
Yes we know that fine-tuning a pretrained language model can bring the problem of forgetting. Mixout is a valid solution but this EMNLP paper proposes an easy-to-use optimizer to resolve the problem.
Do sequence-to-sequence VAEs learn global features of sentences?
It’s a little surprising to see this title cuz we all thought of course VAEs do. However, through well-designed experiments, the authors reveal the other side of this claim.
Pre-Training Transformers as Energy-Based Cloze Models
It’s a really cool idea and it makes sense mathematically. Though the results are modest, there’re definitely more to explore.
BERT-of-Theseus: Compressing BERT by Progressive Module Replacing
Self-promoting. It’s a really neat idea that you can compress a model by simply replacing their components. No additional loss function needed.
Learning from Task Descriptions
Paper : https://www.aclweb.org/anthology/2020.emnlp-main.105.pdf
Presentation : https://slideslive.com/38939344
@VictorSanh mentioned this one but I want to include it as well. They create a new dataset trying to generalize from one set of tasks to another using only task descriptions w/o training data. It’s an ambitious idea to try to formalize and evaluate but I appreciated the work. I’m actually taking a break from adding their dataset “zest” to Datasets to compile this post, so it should be up very soon.
Universal Natural Language Processing with Limited Annotations: Try Few-shot Textual Entailment as a Start
Another approach to “universal” NLP w/ cross-task generalization. The idea here is to pose various tasks as one task (natural language inference) enabling transferability between tasks. Incidentally, the first author is the same who introduced the NLI-based zero-shot classification approach which is roughly the same as the one we now use in our zero-shot pipeline & API.
Text Classification Using Label Names Only: A Language Model Self-Training Approach
Similar to the “zero-shot” setup of Schick et al.'s PET and Yin et al.'s entailment-based approach (though they refer to it as “weak supervision” here). A nice difference from previous work is that they create groups of synonyms to a class label which can be used as a class representation instead of the class name alone. Another demonstration of self-training with unlabeled data only working well for classification.
Experience Grounds Language
Really nice kinda philosophical paper about computational understanding of language. They lay out different “world scopes” to help think about different levels of understanding/experience. Reminiscent in some ways of Bender & Koller’s ACL paper this year, “Climbing towards NLU” and their superintelligent octopus.