I got tokens encoded with one tokenizer which I want to yield to the LM with another tokenizer. problem seems natural so may be there exist some convenient and unified solution to do so? Here’s what I use rn
def translate_to_other_tokenizer(ids, tokenizer_from, tokenizer_to):
text = tokenizer_from.batch_decode(
ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=True
)
output = tokenizer_to(
text,
truncation=False,
padding=True,
return_attention_mask=True,
return_special_tokens_mask=True,
return_tensors='pt',
)
return text, output