Best solution for train tokenizer and MLM from scratch

Hi guys,

I searched over different discussions but I was not able to find an effective answer.
I’d like to train from scratch an MLM model (roberta) wherein the original paper concatenate full sentences in the same line up to 512 tokens. This means that they have split the text of the documents into sentences, but actually I was not sure what is the best tool to do this.
Moreover, on the web, there are thousands of different approaches to training an MLM from scratch.
Some people put a document into a single line, others split a document into sentences and put a sentence in each line.

So my questions are:

  • what is the best solution to train the tokenizer and the model from scratch? split a document into lines or keep the same document in one line?
  • What is in your opinion the best choice to split a document into sentences? (different languages)

Thanks