Hi,
I need to create a hugging face dataset with custom underlying file format. I saw this issue [I need to read the custom dataset in conll format 路 Issue #5014 路 huggingface/datasets 路 GitHub] and the from_generator
function is suggested.
But in my case, batch loading is preferred comparing to load the samples one by one, so I think from_generator
is not very suitable for my case.
How should I implement this? Should I inherit the Dataset class and rewrite __getitem__
function?
For example, when user call dataset[list(range(1, 100, 2))]
, my custom batch_load
function should be called, not loaded one by one 50 times.
Thanks!