Hello,
Context: I am training the MarianTokenizer for a new set of languages. The documentation for the same, describes all the required parameters very clearly - except: “vocab”. Looking at the tokenization_marian.py hasn’t helped.
Question
a. What is this vocab parameter ? [ While I can use sentencepeice’s native lib to generate the required serialized model for both the languages and pass them to source_spm and target_spm respectively ]
I cannot understand how this vocab is to be generated. [ The vocab files created by the sentencepiece module isn’t a json - so I am not sure if this has to be modelled using some other interface/script]
b. Why is their only a single parameter corresponding to this parameter ?
Looking at the vocab initialized in the source code - this file is similar to the vocab + merges generated while training with any one of the tokenizer trainers, shouldn’t their be the provision to specify two vocab files ? One for each of the spiece model supplied to the tokenizer?
Reproducibility Info:
Transformers - 4.12.2
Tokenizers - 0.10.3
Thanks a lot for any help!
Hi, have you solved this problem?
Hey, I was able to successfully train a model instantiated with a vocab generated by following this procedure (Arrived at this with trial/error but assumed to be more or less correct since I am getting a model of expected performance/metrics):
- Create the spm models that will be used for tokenization (using native sentencepiece)
- Run spm_extractor.py found at tokenizers/sentencepiece_extractor.py at main · huggingface/tokenizers · GitHub - This is used to create the vocab files for both the languages separately ( from their respective spm models) (required by the Hugging Face interface) from the serialized models
- Generate a concated vocab file that will be used to instantiate the tokenizer. (Make sure to edit the start_idx to match the len of your source_vocab, and also remove the header that the json will have with the meta info about the normalizer, pre-normalizer, decoder, etc). Omit the merges from both the files before doing so.
Pass this vocab to the tokenizer instantiation.
P.S: There are some definite design choice questions that arise of this entire procedure ( Point b of my original question, for eg), I am still trying to figure out an answer to those.
Thank you for these most useful clarifications.
I still cannot understand how you generate the concatenated vocab file: in MarianMT models, there are single vocab.json files for both source and target languages (this was one of the original issues you ponted to). Yet some tokens may be equal in source and target languages (it almost always happens if languages shares an alphabet or writing system).
So I have a couple questions:
- How do you manage to have a vocab.json smaller than the sum of input+output file entries? I assume you specify the vocab_size when initializing MarianTokenizer by setting it at the number of entries in the json file.
- IDs start from zero for both source and target languages, yet the vocab.json has of course a single ID set starting from zero. This implies at least the target language having different IDs from what was assigned by sentencepiece (and shared entries have the source ID for both). Is it an issue or numbers don’t need to match IDs?
Thank you again,
Giuliano