Tiny mBART doc/info

marton-avrios · July 21, 2020, 3:22pm

I could not find any documentation/info for the sshleifer/tiny-mbart model. How big it is? How it was trained? What is the peformance, etc.? Did I miss something?

valhalla · July 21, 2020, 3:34pm

AFAIK, i think this model is created for testing purpose. Pinging @sshleifer for confirmation.

marton-avrios · July 21, 2020, 3:36pm

…any chance for a production ready smaller mBART? Like mbart-base?

valhalla · July 21, 2020, 3:41pm

not that I know of.

sshleifer · July 21, 2020, 4:25pm

How small do you want @martin-avrios ?
For english-romanian or cc25?

I will distill anything en-ro if you can figure out TPU

tiny-mBART is just for testing purposes, it’s randomly initialized.

marton-avrios · July 21, 2020, 4:40pm

For cc25. I am working with mostly german and some french and italian docs and if I don’t freeze embeds than mbart-large-cc25 does not fit into the 16GB V100 (which is the largest Google Cloud has for a single GPU). Not even with batch size 1. So I thought I try a smaller mBART. But yes, TPU would be even more awesome because I could experiment with a lot more then.

sshleifer · July 28, 2020, 7:19pm

Yeah I have been running everything with --freeze_embeds.

I’m also trying to figure out how to trim the embeddings, as most of them aren’t used, but blocked on

github.com/google/sentencepiece

How to create new model file with restricted vocabulary?

opened 04:28PM - 23 Jul 20 UTC

sshleifer

help wanted feature request

Similar to #474, I want to restrict my vocabulary, and then save a new model fil…e that uses the restricted vocabulary. I tried to do this by saving a vocabulary, modifying it, and then figuring out how to save the restricted model, but I found that even without any modification, running `spm_export_vocab` follow by `spm_encode --vocabulary` produces different results. For example, ```bash echo "Șeful ONU declară că nu există soluții militare în Siria" | spm_encode --model enro_trimmed/sentence.bpe.model = > ▁Ș e ful ▁ONU ▁de cla ră ▁că ▁nu ▁există ▁solu ții ▁militare ▁în ▁Siria ``` ```bash spm_export_vocab --model enro_trimmed/sentence.bpe.model --output=sp_vocab.txt echo "Șeful ONU declară că nu există soluții militare în Siria" | spm_encode --model enro_trimmed/sentence.bpe.model --vocabulary sp_vocab.txt => ▁ Ș e f u l ▁ O N U ▁ d e c l a r ă ▁ c ă ▁ n u ▁ e x i s t ă ▁ s o l u ț i i ▁ m i l i t a r e ▁ î n ▁ S i r i a ``` Is this expected behavior? My end goal is that in python, `spm.encode_as_ids` only produces ids < length of the restricted vocab, so if there is a more direct way to achieve that objective I would love to know it! Thanks!

I’m also happy to finetune mbart-cc25 on a public dataset for you, if that would help.
Also, have you tried using Marian?

marton-avrios · July 30, 2020, 6:53am

Have not tried Marian yet but it seems interesting. It’s for translation, right? But since I used translation mode for my problem it could definitely work.

Also I found excellent pre-trained models on TF Hub but they are not fine-tunable (according to the page). TransformerXL pre-trained on Wiki40B (a new dataset in 40 languages), separate model for each language. At least for me this would be the ultimate model. Seq2seq, unlimited sequence length and 41 languages. See https://tfhub.dev/google/collections/wiki40b-lm/1

sshleifer · August 3, 2020, 10:47pm

yes its for translation.
Try it out! It’s much smaller/faster than bart and we have 1100 language pairs:
https://huggingface.co/models?search=Helsinki-NLP

sshleifer · August 3, 2020, 10:47pm

I’ve never actually finetuned it so let me know if there are any bugs!

marton-avrios · August 4, 2020, 3:25pm

I noticed there is a de-de pair which is exactly what I need but I wonder who else needs this and what for? Looks like somebody already tried summarization in german?

sshleifer · August 6, 2020, 8:03pm

link to model?

marton-avrios · August 6, 2020, 8:30pm

https://huggingface.co/Helsinki-NLP/opus-mt-de-de

sshleifer · August 7, 2020, 2:49pm

I believe that is an accident, not a summarization model.

sshleifer · August 7, 2020, 9:01pm

According to Jörg Tiedemann, the author, it’s a paraphraser, rather than a summarizer. He writes,

There are texts with alternative translations into the same language, which I used for training intralingual models like this one. They are maybe not very useful at this moment as they probably just copy the input text. That this only obtains 40 BLEU is not that strange as this is tested with paraphrased sentences. Note also that this model is really rather a paraphrase model than a summarisation model as they seem to use it in the discussion

Topic		Replies	Views
Incorrect model ``stas/tiny-wmt19-en-ru`` Models	1	313	May 3, 2021
Small miniLM model for multilingual 🤗Transformers	0	326	October 7, 2021
TinyReformer/TinyLongformer details Models	3	432	November 6, 2020
Help with finetuning mBART on an unseen language Models	19	2053	October 30, 2020
Small Decoder-only model < 1B parameters Models	2	167	September 13, 2024

Tiny mBART doc/info

Related topics