For this science Tuesday, I read Marge, and wrote up a brief summary, as well as some interesting questions to discuss @joeddav @srush @VictorSanh @thomwolf @clem @julien-c @teven @patrickvonplaten @yjernite (only allowed 10 tags)
Pre-training via Paraphrasing (MARGE)
Paper: published June 26 2020
Authors are from Facebook AI Research:
Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida Wang, Luke Zettlemoyer.
Summary
Huge models trained with masked-lm pretraining objective, or similar, memorize lots of facts in their parameters and don’t use an external storage to look up facts they are missing. Human brains have separate systems (it seems) for memorizing facts and generating language, and often google things. In this spirit, goal of many transformer+retriever models is to decompose memorization of facts and language understanding. MARGE stands for a Multi-lingual Autoencoder that Retrieves and GEnerates.
The pretraining setup:
reconstruct original document by retrieving related documents (from wiki) and trying to regenerate the original maximize likelihood of original doc conditional on retrieved docs, relevance scores. This implicitly forces the retriever to learn how to generate good relevance scores.
There are some tricks related to not scoring all of wikipedia for every example while keeping relevant articles in each batch.
Every 10k training steps, they remake their batches by computing the cosine similarity of every pair of docs, and then greedily adding source and target docs to batches such that the pairwise sum of cosine similarities increases the most. This obviously seems hacky, but allows them to get away without approximate NN or some other expensive way to find related docs. This, and the fact that a randomly initialized encoder will give docs with lexical overlap higher than random cosine similarity, allows the model to train from random.
The retrieval model, ideally, can focus on getting the transformer all the facts that it needs while the transformer learns to paraphrase, which requires generating fluent language.
For finetuning/inference, you don’t need to use the retrieval part.
Marge performs…:
- comparably to XLM-Roberta, with 20% of the pretraining compute.
- comparably to mbart on de-en, en-zh translation
- SOTA on ml-sum, a cross lingual summarization task
Key contributions:
(1) Most of the related work is not multilingual
(2) most of the related work does not zero-shot well?
(3) this pretraining objective unifies learning to retrieve and learning to generate. Previous work requires two pretraining stages.
Related Work
Realm: “At a high level, the method goes like this: find the most similar text passages in BERT space, add those passages to the input as additional context, and then make a prediction.” -Joe a few weeks ago
- different because the retriever has to be pretrained separately. Realm also seems to use mostly open domain QA benchmarks.
RAG (Retrieval-Augmented Generation)
- Different because mostly focused on knowledge intensive benchmarks. MARGE can also do well on translation.
- Starts with bart-large + DPR, whereas MARGE pretrains end-to-end.
Questions somebody could answer:
- Does MARGE outperform Bart on english only benchmarks like GLUE/ xsum summarization? Why did they only show multilingual benchmarks?
- When will there be code?
- How long does a forward pass take?
- What are the consequences of not using retrieval during inference. Does the model not “know” anything?
Higher Level:
- Is Translation “knowledge intensive”?
- How could we measure hallucinations?
- Authors suggest that we should use a pre-training that is as close as possible to the dowstream task. Pegasus paper also suggests this. Where else could this idea be applied?
Also these two talks are good:
https://slideslive.com/38929793/beyond-bert (Mike Lewis at ACL)
https://www.youtube.com/watch?v=KTQPWoQ7Ol8 (Luke Zettlemoyer at AKCD)