I load NLLB Tokenizer with new vocab after adding some vocabs using sentencepiece.
tokenizer = NllbTokenizer.from_pretrained('facebook/nllb-200-distilled-600M', vocab_file=NEW_SPM_NAME)
However, I have a problem with several doubled tokens in the lang_code.
print(tokenizer.convert_ids_to_tokens([256001]))
['ace_Arab']
print(tokenizer.convert_ids_to_tokens([270130]))
['ace_Arab']
in addition, the length of tokenizer and vocab size are different
print(len(tokenizer)), print(tokenizer.vocab_size)
270130
270333
It looks like that tokenizer.added_tokens_encoder is still in the old config before loading the new vocab
print(tokenizer.added_tokens_encoder)
{'<s>': 0,
'<pad>': 1,
'</s>': 2,
'<unk>': 3,
'ace_Arab': 256001,
'ace_Latn': 256002,
'acm_Arab': 256003,
'acq_Arab': 256004,
'aeb_Arab': 256005,
'afr_Latn': 256006,
'ajp_Arab': 256007,
'aka_Latn': 256008,
'amh_Ethi': 256009,
'apc_Arab': 256010,
'arb_Arab': 256011,
'ars_Arab': 256012,
'ary_Arab': 256013,
'arz_Arab': 256014,
'asm_Beng': 256015,
....
print(tokenizer.fairseq_tokens_to_ids)
{'<s>': 0,
'<pad>': 1,
'</s>': 2,
'<unk>': 3,
'<mask>': 270332,
'ace_Arab': 270130,
'ace_Latn': 270131,
'acm_Arab': 270132,
'acq_Arab': 270133,
'aeb_Arab': 270134,
'afr_Latn': 270135,
'ajp_Arab': 270136,
'aka_Latn': 270137,
'amh_Ethi': 270138,
'apc_Arab': 270139,
'arb_Arab': 270140,
'ars_Arab': 270141,
How to solve it?