Train through multiple datasets

Hi I’m beginner to Huggingface. Currently I want to handle a very big dataset. I split it into smaller ones so that I can process them by using less RAM. However, I’ve seen that Huggingface do not allow iteration through multiple dataset. I wonder that is there a method to iterate through several datasets or load them without training from beginning.

Hi!

I split it into smaller ones so that I can process them by using less RAM.

You can skip this part as our datasets library uses memory-mapping when loading datasets to support the loading/preprocessing of datasets bigger than RAM.

However, I’ve seen that Huggingface do not allow iteration through multiple dataset. I wonder that is there a method to iterate through several datasets or load them without training from beginning.

In what manner would you like to iterate over these datasets? If you want to concatenate them, you can use concatenate_datasets or interleave_datasets to have them interleaved.

1 Like