Hi,
I haven’t found this discussed in this forum or the datasets documentation, sorry if I missed it.
I have thousands of very large values_file
s that contain signal data. (Note: The .cache/huggingface
directory will probably have to be placed on the same filesystem as the large files, because it’s the only place where there’s enough space. )
My training/test/validation examples would come from sampling those signal files to extract numerical values, as well as sampling a reference_file
that would produce text strings that then need to get tokenised.
The general problem for the neural network will be to map text strings to signals (a regression problem). For each example, we sample the same reference_file
and a different values_file
:
reference_file
→ produces the string that needs to be tokenised
values_file
→ produces values for regression, as well as some metadata
The sampling would follow some rules implemented in a sampling_policy()
function. The same file can be sampled randomly multiple times in multiple positions. I don’t need to keep the examples themselves, but I’d need to keep a log of which files have been sampled and in which positions, and be able to recreate examples used in training for debug purposes.
The sampling process may happen several times, to produce multiple datasets of examples. Not sure whether these would be considered different Configurations, because they’d have the same text / regression values format. It would just be like repeating the sampling process with different seeds and/or sampling_policy()
functions.
What would be the best way to integrate this with the Datasets library?
-
Write a python script that has nothing to do with Datasets, and samples the files and creates a CSV/json/parquet examples file, and then simply
load_dataset()
of the examples file. This seems the simplest way, but it requires creating very large example files, whereas a much smaller file saying which file you sampled and where could be enough. -
Write a new dataset loading script from the template in a way that the data could be streamed. The filenames of the large files are features, and then
_generate_examples()
would return a generator of examples that produces(key, example)
tuples, where example could have features like these?example['text_to_tokenise'] example['tokenised_text'] example['regression_values'] example['property_1'] example['property_2'] example['property_3']
Ideas and suggestions are very welcome, thanks!