Create new Dataset for very large files that need to be sampled

Hi,

I haven’t found this discussed in this forum or the datasets documentation, sorry if I missed it.

I have thousands of very large values_files that contain signal data. (Note: The .cache/huggingface directory will probably have to be placed on the same filesystem as the large files, because it’s the only place where there’s enough space. )

My training/test/validation examples would come from sampling those signal files to extract numerical values, as well as sampling a reference_file that would produce text strings that then need to get tokenised.

The general problem for the neural network will be to map text strings to signals (a regression problem). For each example, we sample the same reference_file and a different values_file:

reference_file → produces the string that needs to be tokenised
values_file → produces values for regression, as well as some metadata

The sampling would follow some rules implemented in a sampling_policy() function. The same file can be sampled randomly multiple times in multiple positions. I don’t need to keep the examples themselves, but I’d need to keep a log of which files have been sampled and in which positions, and be able to recreate examples used in training for debug purposes.

The sampling process may happen several times, to produce multiple datasets of examples. Not sure whether these would be considered different Configurations, because they’d have the same text / regression values format. It would just be like repeating the sampling process with different seeds and/or sampling_policy() functions.

What would be the best way to integrate this with the Datasets library?

  1. Write a python script that has nothing to do with Datasets, and samples the files and creates a CSV/json/parquet examples file, and then simply load_dataset() of the examples file. This seems the simplest way, but it requires creating very large example files, whereas a much smaller file saying which file you sampled and where could be enough.

  2. Write a new dataset loading script from the template in a way that the data could be streamed. The filenames of the large files are features, and then _generate_examples() would return a generator of examples that produces (key, example) tuples, where example could have features like these?

    example['text_to_tokenise']
    example['tokenised_text']
    example['regression_values']
    example['property_1']
    example['property_2']
    example['property_3']
    

Ideas and suggestions are very welcome, thanks!

Hi ! Both options are fine IMO :slight_smile:

If you think your work is a one shot experiment, you can do the first option since it’s probably simpler.

Though while the second requires maybe a bit more work, it would save some disk space to people, and also enable other people to choose the sampling_policy:

# sample the dataset once and save to disk
ds = load_dataset("rcasero/my_dataset")
# sample the dataset on-the-fly while streaming
ds = load_dataset("rcasero/my_dataset", streaming=True)

# customize the sampling
ds = load_dataset("rcasero/my_dataset", sampling_policy="split_per_line")

# use your own policy
def sampling_policy(...):
    pass

ds = load_dataset("rcasero/my_dataset", sampling_policy=sampling_policy)
1 Like