I am importing an image dataset from an external source that is several terabytes in size. In the future, I will need to update this dataset by adding new files.
I found that I can achieve this simply by placing the new Parquet files in the same folder as the existing ones while keeping the column names consistent.
Is there a way to append-only uploads using the datasets
library?
1 Like
Features like incremental upload may still be in the works…
opened 03:18PM - 10 Oct 23 UTC
enhancement
### Feature request
Have the possibility to do `ds.push_to_hub(..., append=True… )`.
### Motivation
Requested in this [comment](https://huggingface.co/datasets/laion/dalle-3-dataset/discussions/3#65252597c4edc168202a5eaa) and
this [comment](https://huggingface.co/datasets/laion/dalle-3-dataset/discussions/4#6524f675c9607bdffb208d8f). Discussed internally on [slack](https://huggingface.slack.com/archives/C02EMARJ65P/p1696950642610639?thread_ts=1690554266.830949&cid=C02EMARJ65P).
### Your contribution
What I suggest to do for parquet datasets is to use `CommitOperationCopy` + `CommitOperationDelete` from `huggingface_hub`:
1. list files
2. copy files from parquet-0001-of-0004 to parquet-0001-of-0005
3. delete files like parquet-0001-of-0004
4. generate + add last parquet file parquet-0005-of-0005
=> make a single commit with all commit operations at once
I think it should be quite straightforward to implement. Happy to review a PR (maybe conflicting with the ongoing "1 commit push_to_hub" PR https://github.com/huggingface/datasets/pull/6269)
By any chance was this made into a feature request or perhaps even implemented? I agree that this would be a great feature