[MarianTokenizer] Clarify the use of the vocab parameter

Context: I am training the MarianTokenizer for a new set of languages. The documentation for the same, describes all the required parameters very clearly - except: “vocab”. Looking at the tokenization_marian.py hasn’t helped.

a. What is this vocab parameter ? [ While I can use sentencepeice’s native lib to generate the required serialized model for both the languages and pass them to source_spm and target_spm respectively ]

I cannot understand how this vocab is to be generated. [ The vocab files created by the sentencepiece module isn’t a json - so I am not sure if this has to be modelled using some other interface/script]

b. Why is their only a single parameter corresponding to this parameter ?
Looking at the vocab initialized in the source code - this file is similar to the vocab + merges generated while training with any one of the tokenizer trainers, shouldn’t their be the provision to specify two vocab files ? One for each of the spiece model supplied to the tokenizer?

Reproducibility Info:
Transformers - 4.12.2
Tokenizers - 0.10.3

Thanks a lot for any help!

Hi, have you solved this problem?

Hey, I was able to successfully train a model instantiated with a vocab generated by following this procedure (Arrived at this with trial/error but assumed to be more or less correct since I am getting a model of expected performance/metrics):

  1. Create the spm models that will be used for tokenization (using native sentencepiece)
  2. Run spm_extractor.py found at tokenizers/sentencepiece_extractor.py at main · huggingface/tokenizers · GitHub - This is used to create the vocab files for both the languages separately ( from their respective spm models) (required by the Hugging Face interface) from the serialized models
  3. Generate a concated vocab file that will be used to instantiate the tokenizer. (Make sure to edit the start_idx to match the len of your source_vocab, and also remove the header that the json will have with the meta info about the normalizer, pre-normalizer, decoder, etc). Omit the merges from both the files before doing so.

Pass this vocab to the tokenizer instantiation.
P.S: There are some definite design choice questions that arise of this entire procedure ( Point b of my original question, for eg), I am still trying to figure out an answer to those.