How to set multiple files in `LineByLineTextDataset`?

In this Colab notebook, “01_how-to-train.ipynb”, when defining the training dataset, it uses only one file_path, as shown below:

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./oscar.eo.txt",
    block_size=128,
)

How can I give multiple files to this dataset? Thank you!

In How to train a new language model from scratch using Transformers and Tokenizers tutorial, it defines a new class and uses the following line

src_files = Path("./data/").glob("*-eval.txt") if evaluate else Path("./data/").glob("*-train.txt")

to achieve this. But this tutorial does not leverage the new Trainer approach as in the Colab notebook.

2 Likes

You can use this:

cat ./* > merged-file

And use this file which is created to make dataset