I want to convert my custom dataset into huggingface dataset by creating it from from_generator
first and then save_to_disk
so that I can load_from_disk
later.
The generator function returns the following:
# Each sample can have different number for dimension 0
# like mentioned in https://huggingface.co/docs/datasets/v3.0.2/en/about_dataset_features#dataset-features
# where it is suggested to use None like: features = Features({'a': Array3D(shape=(None, 5, 2), dtype='int32')})
def generatorFn():
...
sample = {
"example" : np.ones([4,5]), # 2D numpy array
"label" : np.ones([10,5]), # 2D numpy array but dim 0 might not be common
"numbers" : np.ones([7]), # 1D numpy array
"has_noise" : True, # bool
}
yield sample
What would be the recommended types be in the features? Is it Sequence
or Array2D
+ Array1D
?
How to make sure I get the correct shape
and dtype
when loading back the dataset?
Addtional info
I have two large files, one which stores examples as rows and the other store labels along with some metadata as rows. For the example file, each row consists of 2D coordinates of dtype float32. I have saved them in the format 12.23,2332.343:54.45,767.767:...
where comma
is used to separate inside the 2D coords and colon
to separate between 2D coords.
I have some processor
functions to convert each row into numpy 2D array. I also have some other processor
functions to read rows in the label file which has 2D coords along with some metadata.
I want to use hugging face datasets because it has very good compression and easy to use interface