In this Colab notebook, “01_how-to-train.ipynb”, when defining the training dataset
, it uses only one file_path
, as shown below:
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./oscar.eo.txt",
block_size=128,
)
How can I give multiple files to this dataset
? Thank you!
In How to train a new language model from scratch using Transformers and Tokenizers tutorial, it defines a new class and uses the following line
src_files = Path("./data/").glob("*-eval.txt") if evaluate else Path("./data/").glob("*-train.txt")
to achieve this. But this tutorial does not leverage the new Trainer
approach as in the Colab notebook.