How to set multiple files in `LineByLineTextDataset`?

Roden · April 5, 2021, 8:59am

In this Colab notebook, “01_how-to-train.ipynb”, when defining the training dataset, it uses only one file_path, as shown below:

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./oscar.eo.txt",
    block_size=128,
)

How can I give multiple files to this dataset? Thank you!

src_files = Path("./data/").glob("*-eval.txt") if evaluate else Path("./data/").glob("*-train.txt")

to achieve this. But this tutorial does not leverage the new Trainer approach as in the Colab notebook.

ishaansharma · February 28, 2023, 7:30am

You can use this:

cat ./* > merged-file

And use this file which is created to make dataset

Topic		Replies	Views
Can we download dataset from folder of text file 🤗Datasets	2	1224	January 18, 2022
Multiple sentences in RoBERTa training 🤗Datasets	0	573	August 10, 2021
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12709	October 6, 2021
Question answering Beginners	0	290	November 1, 2021
Loader for dataset with multiple source files in one split 🤗Datasets	1	782	May 9, 2022