Preprocessing required for fine-tuning RoBERTa

I am finetuning a QA model in Hindi using a trained Roberta LM. I need to preprocess the dataset for Roberta. What are the steps that I need to take before I feed the input to the model? One script for English is given here. I am not sure if other languages behave in a similar way. e.g The linked notebook adds an extra " " character before the start token. Is this necessary for Roberta? What are other nuances that have to be taken care of?
I am only concerned about spans at token-level and not character-level as explained in the link mentioned above.
Thank you.

Hi @kushalj001,
Maybe you can consider XLM-Roberta (XLM-R) instead of Roberta?

XLM-R supports 100 languages out-of-the-box, including its tokenizer which can tokenize most language in the world.

This kaggle notebook show how to finetune 3 languages simultaneously on XLM-R with TF2+TPU , which is extremely efficient (10x faster than P100)
https://www.kaggle.com/riblidezso/finetune-xlm-roberta-on-jigsaw-test-data-with-mlm

2 Likes