Best solution for train tokenizer and MLM from scratch

Stefano · December 6, 2021, 4:13pm

Hi guys,

I searched over different discussions but I was not able to find an effective answer.
I’d like to train from scratch an MLM model (roberta) wherein the original paper concatenate full sentences in the same line up to 512 tokens. This means that they have split the text of the documents into sentences, but actually I was not sure what is the best tool to do this.
Moreover, on the web, there are thousands of different approaches to training an MLM from scratch.
Some people put a document into a single line, others split a document into sentences and put a sentence in each line.

So my questions are:

what is the best solution to train the tokenizer and the model from scratch? split a document into lines or keep the same document in one line?
What is in your opinion the best choice to split a document into sentences? (different languages)

Thanks

Topic		Replies	Views
Best practice for MLM: full text or break into sentences? Beginners	0	486	November 18, 2021
RoBERTa MLM fine-tuning Beginners	1	1873	November 24, 2021
Training from scratch without any pre-trained MLM model Models	0	289	August 16, 2023
Further pre-training the tokenizer? 🤗Tokenizers	0	821	April 30, 2022
Which strategy is better for text pre-processing in training a transformer model Beginners	0	235	January 2, 2022

Best solution for train tokenizer and MLM from scratch

Related topics