Hey there.
I am performing a pre-training on RoBERTa for masked language modelling and have performed tokenizer from ByteLevelBPETokenizer
.
I have got merges.txt
and vocab.json
and have provided its path to RobertaForMaskedLM
in line by line dataset.
My question is - how to know the default split ratio while performing pre-training using Trainer
- is there any way to change the data split ratio for train, test and valid ??
Because I am getting the Loss, epochs and and learning rate in the output , but I haven’t provided any particular split ratio explicitly.
Thanks