Is there a straightforward way of adding a new language to MBART’s tokenizer? The implementation seems quite intricate so it does not seem straightforward to add a new language code.
self.sp_model_size = len(self.sp_model)
self.lang_code_to_id = {
code: self.sp_model_size + i + self.fairseq_offset for i, code in enumerate(FAIRSEQ_LANGUAGE_CODES)
}
self.id_to_lang_code = {v: k for k, v in self.lang_code_to_id.items()}
self.fairseq_tokens_to_ids["<mask>"] = len(self.sp_model) + len(self.lang_code_to_id) + self.fairseq_offset
self.fairseq_tokens_to_ids.update(self.lang_code_to_id)
self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
self._additional_special_tokens = list(self.lang_code_to_id.keys())
if additional_special_tokens is not None:
# Only add those special tokens if they are not already there.
self._additional_special_tokens.extend(
[t for t in additional_special_tokens if t not in self._additional_special_tokens]
)
self._src_lang = src_lang if src_lang is not None else "en_XX"
self.cur_lang_code_id = self.lang_code_to_id[self._src_lang]
Even with subclassing it is not immediately clear to me if and how one would add a custom language code that will be correctly recognized when using tokenizer(target_text=...)
or other target language related things. Any tips?