Number of words


I am using ‘Helsinki-NLP/opus-mt-en-sla’ model, but I can not catch a pattern, how many words can this model transalte…? Can I see somewhere this settings and how I can change number of words?

Thank you

Hi Katarina,

this page might help config.json · Helsinki-NLP/opus-mt-en-sla at main

I think this says that the maximum number of tokens, max_length, for this model is 512.

512 tokens might correspond to about 2500 characters (~letters), which might correspond to about 400 words. This is a very rough approximation, and different texts will have different conversion values.

If you want to know more about tokens, there’s a nice introduction to BERT tokens by Chris McCormick BERT Word Embeddings Tutorial · Chris McCormick (I imagine that the Marian model uses something similar).

1 Like

These models seem to use sentencepiece (different from BERT which uses WordPiece) and therefore are less restricted to actual word boundaries. (For sentencepiece, a space is just like any other character - not a word boundary.) There is no way for you to know how many characters or words 512 subword tokens include. As you say the approximation is incredibly rough.

1 Like

OK, thanks Bram, I haven’t looked at sentencepiece tokenizing.
My wordpiece estimates were based on samples from my specific set of texts.

Katarina: this page of the huggingface documentation discusses the different kinds of tokenizers Summary of the tokenizers — transformers 4.3.0 documentation and this blog looks like it might be useful Tokenizers: How machines read

1 Like

Thank you Rachael and Bram, I will check proposed links…as I noticed this model translates around 800 characters ( letters plus spaces) which is around 100 words but it could always be little more or less…so you think it is not possible to set the higher limit because I need these model for translation of some documentation and it is not so useful if I have to cut text in small peaces.

What you typically do is split up the text automatically in sentences and then translate the sentences one by one (or as many as the model allows), in batches. You can use stanza's multilingual models to do this. Here’s an example: Tokenization & Sentence Segmentation - Stanza

1 Like

Thank you Bram!