Marian: Language Discovery questions

sshleifer · August 18, 2020, 4:17am

Some useful resources:

Search ISO 639 code tables to see if a multilingual model is what you think:
https://iso639-3.sil.org/code_tables/639/data?title=aav&field_iso639_cd_st_mmbrshp_639_1_tid=All&name_3=&field_iso639_element_scope_tid=All&field_iso639_language_type_tid=All&items_per_page=200

Otherwise feel free to ask something like “Do you have english to german?” Yes, en-de!

sshleifer · August 18, 2020, 3:38pm

This gist lists all newly added three letter codes and the consituent languages (often other three letter codes) https://gist.github.com/sshleifer/e79fbbabe0fab3da519fd39edffee4d2

If you know your lang’s ISO-639-3 code, you can cmd-f for it in that file to see which models will support it.

sshleifer · September 14, 2020, 3:20pm

Backtranslation Snippet


from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
mname_fwd = 'Helsinki-NLP/opus-mt-en-ceb'  #ceb=cebuano https://en.wikipedia.org/wiki/Cebuano_language
mname_bwd = 'Helsinki-NLP/opus-mt-ceb-en'
src_text = ['I am a small frog with tiny legs.']
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
fwd = AutoModelForSeq2SeqLM.from_pretrained(mname_fwd).to(torch_device)
fwd_tok = AutoTokenizer.from_pretrained(mname_fwd)
bwd_tok = AutoTokenizer.from_pretrained(mname_bwd)
bwd = AutoModelForSeq2SeqLM.from_pretrained(mname_bwd).to(torch_device)
if torch_device == 'cuda':
    fwd = fwd.half()
    bwd = bwd.half()

fwd_batch = fwd_tok(src_text, return_tensors='pt').to(torch_device)
translated = fwd.generate(**fwd_batch, num_beams=2)
translated_txt = fwd_tok.batch_decode(translated, skip_special_tokens=True)
bwd_batch = bwd_tok(translated_txt, return_tensors='pt').to(torch_device)
backtranslated = bwd.generate(**bwd_batch, num_beams=2)
result = bwd_tok.batch_decode(backtranslated, skip_special_tokens=True)
# ['I am a small toad with small feet.']

alejandrocastaneira · September 15, 2020, 7:57am

!Wao! This looks great! I have one doubt, in my case translation and back translation takes about 1 sec, I have tried several models and different example scripts of how to do translation and still slow, I have an RTX 2080 ti, please, I would like to know if I am missing something.

sshleifer · September 15, 2020, 1:43pm

(1) Try bigger batches
(2) feel free to send your code + don’t count any of the lines until fwd_batch in your timing.

alejandrocastaneira · September 15, 2020, 3:25pm

Hi thanks for the quick answer, I’ll post then the code:

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

from timeit import default_timer as timer

src_text = ['Senior Level Software Engineer', 'Instrumental Software Technologies', 'Saratoga Springs, NY']

torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'

print(torch_device)

mname_fwd = 'Helsinki-NLP/opus-mt-en-es'

fwd = AutoModelForSeq2SeqLM.from_pretrained(mname_fwd).to(torch_device)
fwd_tok = AutoTokenizer.from_pretrained(mname_fwd)

if torch_device == 'cuda':
    fwd = fwd.half()

star = timer()

fwd_batch = fwd_tok(src_text, return_tensors='pt', padding=True, truncation=True).to(torch_device)
translated = fwd.generate(**fwd_batch)
translated_txt = fwd_tok.batch_decode(translated, skip_special_tokens=True)

end = timer()

print(translated_txt)

print(end-star)

cuda
[‘Ingeniero de software de nivel superior’, ‘Tecnologías de software instrumentales’, ‘Saratoga Springs, NY’]
0.5867295530042611

sshleifer · September 15, 2020, 10:40pm

587 ms/3 samples = 196ms/sample
If you pass in src_text of length 128, you should get a similar runtime and therefore lower ms/sample.

Topic		Replies	Views
BCP-47 or at least ISO 639-3 support in Model Hub tags Languages at Hugging Face	2	1031	June 12, 2022
Nahuatl: Fine-Tuning Wav2Vec Languages at Hugging Face	11	1090	May 3, 2021
[new model] FSMT has been released + 9 models ported 🤗Transformers	3	1142	September 25, 2020
To be in the club, to be in the model hub Languages at Hugging Face	1	769	June 24, 2021
Facebook mbart multilingual translation Beginners	0	488	February 1, 2023

Marian: Language Discovery questions

Backtranslation Snippet

Related topics