Pre-Training From Scratch


I am currently pre-training RoBERTa model from scratch based on the tutorial in here: Google Colab and I got questions. Some of these might not make too much sense since I am relatively new in the area and I don’t have a strong theoretical and practical background in transformers yet. And sorry in advance if these questions are too many.

  1. I see that the data used in this tutorial is separated line by line and the train and test data is put into LineByLineTextDataset function before training. What is the usage of this function ? I couldn’t find an explanation of it in Huggingface documents. Even though I have guesses, I would like to know better. As I see the data in the tutorial is already split line by line. So, what would have been lost if these data are not put LineByLineTextDataset function before starting the training process ?

  2. The data I will use in pre-training process is not split line by line. In my data, each example have multiple sentences. So, I assume that I shouldn’t use LineByLineTextDataset function. Is there any other function I can use instead ? Or would it be wrong if I would join the whole dataset in one very long text and then split it into sentences and then use LineByLineTextDataset function ?

  3. This will be a little bit technical question. In RoBERTa paper, it was written that the size of the data they used for pre-training process is 160 GB. If I would pre-train the model with 10-20 GB of data, would this decrease the performance of the pre-trained model drastically or is it okay to use RoBERTa architecture as long as the training arguments are set according to this dataset ?

  4. I see a block_size parameter in LineByLineTextDataset function. What is the difference between block_size and max_len ?

  5. Is there a parameter we need to specify to get a cased or uncased model during the pre-training process ? Or is that handled by the model based on the format of the input text automatically ? (If we don’t lower the characters in the text data, for example, the model will be cased and if we do lower, the model will be uncased. Is this how it works ?)

  6. I see that ByteLevelBPETokenizer is used in the tutorial I shared at the beginning. Should we add a post processor (like the one below) to this tokenizer to be able to fine-tune it later on a task that requires pairs of text (e.g. Question and Answering) ?

tokenizer.post_processor = TemplateProcessing(
single="< s > $A < /s >",
pair="< s > $A $B:1 < /s >:1",
("< s >", tokenizer.token_to_id("< s >")),
("< /s > “, tokenizer.token_to_id(”< /s > "))])