I have a kind of issue with using datasets properly and if you have a better solution, than my super dumb one, please share.
Context:
The task I am solving is panoptic segmentation.
I have a directory with the following structure:
I create the dataset from the DatasetDict, and then push_to_hub. The data gets magically converted to Arrow+Parquet and uploaded to HF Hub.
Everything is nice, until the dataset becomes several millions of images, and the upload crashes at random moment. It is also a very slow process.
So, I was advised to use HF CLI tool, specifically this PR.
It works amazing - fast and reliable.
However, now I have to convert the dataset to Arrow+Parquet myself.
Moreover, I cannot just shard the dataset, because then the images are not part of the parquet files, only the filenames.
So the solution below does not work:
for k, v in dataset.items():
print (f"Saving {k}")
shard_size = 5000
num_shards = len(v) // shard_size + 1
for shard_idx in range(num_shards):
shard = v.shard(index=shard_idx, num_shards=num_shards)
shard.to_parquet(f"out/parquet/data/{k}-{str(shard_idx).zfill(5)}-of-{str(num_shards).zfill(5)}.parquet") # 00000.parquet to 01023.parquet
What I have to do instead, is first to save the dataset to disk in arrow format, then read from it, and then save in parquet.
dataset.save_to_disk("arrowdir", num_proc=48)
dataset = datasets.load_from_disk("arrowdir")
for k, v in dataset.items():
print (f"Saving {k}")
shard_size = 5000
num_shards = len(v) // shard_size + 1 # set number of files to save (e.g. try to have files smaller than 5GB)
for shard_idx in range(num_shards):
shard = v.shard(index=shard_idx, num_shards=num_shards)
shard.to_parquet(f"out/parquet/data/{k}-{str(shard_idx).zfill(5)}-of-{str(num_shards).zfill(5)}.parquet") # 00000.parquet to 01023.parquet
This way I have to use disk space 3x size of the dataset: for original images, for arrow, and for parquet.
I also don’t like that shading is done manually, so it is error-prone.
Another way would be probably to use webdataset, and to do this I have to move my original files from