In this Colab notebook, “01_how-to-train.ipynb”, when defining the training
dataset, it uses only one
file_path, as shown below:
dataset = LineByLineTextDataset( tokenizer=tokenizer, file_path="./oscar.eo.txt", block_size=128, )
How can I give multiple files to this
dataset? Thank you!
In How to train a new language model from scratch using Transformers and Tokenizers tutorial, it defines a new class and uses the following line
src_files = Path("./data/").glob("*-eval.txt") if evaluate else Path("./data/").glob("*-train.txt")
to achieve this. But this tutorial does not leverage the new
Trainer approach as in the Colab notebook.