Hi,
currently i was working on project which was to download the dataset from the open source hosted files (zip or H5), my aim was to make them avail on HF datasets, in chunked form
my workflow:
org dataset → download in temp → process → chunk → chunk upload to hf
problem i’m facing, when we try to give a hf jobs it’s only allowed to take 50GB of ephemeral disk space.
but my datasets avg disk need is around 70GB so it will cause double disk issue,
I tried streaming without downloading it with fsspec and zarr for chunking
please provide if anyone have solutions for it
thank you