In this Colab notebook, “01_how-to-train.ipynb”, when defining the training dataset, it uses only one file_path, as shown below:
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./oscar.eo.txt",
block_size=128,
)
How can I give multiple files to this dataset? Thank you!
In How to train a new language model from scratch using Transformers and Tokenizers tutorial, it defines a new class and uses the following line
src_files = Path("./data/").glob("*-eval.txt") if evaluate else Path("./data/").glob("*-train.txt")
to achieve this. But this tutorial does not leverage the new Trainer approach as in the Colab notebook.