Iām currently trying to load in a proprietary dataset from disk. When the script finishes loading the train split I recieve the error: TypeError: Argument āstorageā has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray).
The files themselves are a large set of videos. I load in each video using pims and extract a set of frames as PIL Images. These are then stored in dataset rows with a few other bits of info e.g. class-labels and filenames.
The script works fine when I extract a single frame from each video, and then generates the Typeerror with any more. I can only presume it is something to do with memory size, 10 frames per video uses ~16gb of my 32gb of memory.
Is there some way to do intermediate Arrow-file writes to lower memory overhead?
Alternatively is there some way to pass user defined variables to a load_dataset function, such that I can load the dataset multiple times extracting different frames and concatenate the resultant datasets?
Alternatively any ideas on how to fix the error? Iām using datasets 1.17.0
Thanks in advance
Hi! Could you please copy and paste the entire stack trace?
Is there some way to do intermediate Arrow-file writes to lower memory overhead?
You can control RAM usage in dataset scripts with the DEFAULT_WRITER_BATCH_SIZE attribute of GeneratorBasedBuilder (as we do in this script).
The files themselves are a large set of videos. I load in each video using pims and extract a set of frames as PIL Images. These are then stored in dataset rows with a few other bits of info e.g. class-labels and filenames.
In datasets 1.18.0, we added support for nested decoding of the Image feature which is ideal for your use-case. To use it, just define the features dict as:
features = Features({
"frames": Sequence(Image(), length=10),
"meta": ...
})
Thanks for the fast response. Iāll need a day or two to get the full stack trace, Iāll have to go back to an earlier version of the code. I found a hacky way around it by hardcoding a rather large amount of builderconfigs, one for each frame, and then concatenating the individually āconfiggedā databases.
Itās also possible that this was fixed in 1.18, I had no idea there was a new version.
With regards to the new features in 1.18 they look perfect, I shall investigate!