How to create a custom format dataset with custom batch load function?


I need to create a hugging face dataset with custom underlying file format. I saw this issue [I need to read the custom dataset in conll format 路 Issue #5014 路 huggingface/datasets 路 GitHub] and the from_generator function is suggested.
But in my case, batch loading is preferred comparing to load the samples one by one, so I think from_generator is not very suitable for my case.

How should I implement this? Should I inherit the Dataset class and rewrite __getitem__ function?

For example, when user call dataset[list(range(1, 100, 2))], my custom batch_load function should be called, not loaded one by one 50 times.


I think you should generate the dataset with from_generator and then use set_transform to define a transform to execute when indexing the dataset.