Multiple Custom PyTorch Datasets

Hello,

I want to create a Huggingface Dataset to host on the Hub, but I have a somewhat complex scenario and am looking for advice on what’s the best approach to do this.

I have two different auto-regressive training tasks, let’s call them “A” and “B”. For each of these tasks I have 3 different datasets, and each of those datasets have 3 splits (train, val, test). Each dataset is described by two files: a FASTA file (for which I have a custom reading function), and a splts.json, that assigns rows of the FASTA file to a split. Visually, all the data looks like this:

–A
------- dataset 1
-------------- data.fasta
-------------- splits.json
------- dataset 2
-------------- data.fasta
-------------- splits.json

–B
------- dataset 1
-------------- data.fasta
-------------- splits.json
------- dataset 2
-------------- data.fasta
-------------- splits.json

Currently, I’m reading the data with a simply Pytorch Dataset class. Is the best approach to 1) load this data structure to the hub, 2) create a dataset loading script where the _generate_examples uses the PyTorch Dataset class, and 3) leverage loading script configuration and splits ? I could set the configuration to A or B, and splits to train, val, test, but not sure how to define the dataset (i.e., dataset 1 or dataset 2)

Any advice would be greatly appreciated.

Thanks!!

1 Like

The backend of the HF library is usually torch, so I don’t think there will be any problems with the dataset loading script in torch’s Dataset format, and if you’re familiar with it, I think it’s better to do it that way, but I think it is currently recommended to build the dataset using the DatasetBuilder class in the Hugging Face datasets library as much as possible.
There also seems to be a class that sets split. I couldn’t find a good English know-how page, so I’ll introduce a Japanese page, but I think you can understand the flow even from the translation. I think you can understand it roughly just from the code and class names. I don’t know if it will fit your dataset.
If it doesn’t fit the existing framework well, it may be faster and cause fewer problems to upload it as it is in two or three parts rather than trying to force it to fit.

1 Like

Thank you! This is very helpful. Using generator seems the easiest, but the dataset builder offers great flexibility, from what I see. For instance, it would make it easy to have a max_text_length argument or any other argument that changes how the dataset should be created, on-demand. With generator or predefined methods, the datasets would be static.

I’m going with the builder class!

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.