Default mBERT tokenizes a sentence as ['[CLS]', 'This', 'is', 'a', 'sample', 'sentence', '[SEP]']
. I want to change this behaviour and add a language specific token after the CLS
token like this: ['[CLS]', '__en__', 'This', 'is', 'a', 'sample', 'sentence', '[SEP]']
I know TemplateProcessing
can be used to achieve this if the language token doesn’t change
from tokenizers.processors import TemplateProcessing
tokenizer._tokenizer.post_processor = TemplateProcessing(
single=f"{_lang_token} $A [SEP]",
pair=f"{_lang_token} $A [SEP] $B:1 [SEP]:1",
special_tokens=[("[SEP]", tokenizer.convert_tokens_to_ids("[SEP]")),
(_lang_token, tokenizer.convert_tokens_to_ids(_lang_token))],
)
But in my case, the language token changes with every batch. What is the best way to add these tokens? Creating TemplateProcessing
objects every time seems inefficient.