Any workaround for push_to_hub() limits?

Hi all,

What I am trying to do is to push the dataset I created locally (which is around 1.2TB) to huggingface datasets.It contains large images and some other textual data paired with them.

What I follow:
Pass a generator to Dataset.from_generator(), which reads image files (bytes with the help of datasets.Image.encode_example(value=some_pil_image) ) and textual info from local files:
dataset = Dataset.from_generator(dataset_gen, features=features)
After that I call:
dataset.push_to_hub(dataset_id)
But It limits size of each shard to around 500mb (like, 15 image file or so). Which results in ~2400 parquet files to upload. This exceeds the rate limit for github commits.
If I push it like:
dataset.push_to_hub(dataset_id, num_shards=5)
It throws an error:

"
Map: 0%| | 0/1334 [00:06<?, ? examples/s]
Traceback (most recent call last):
File β€œ/home/ubuntu/hf_push.py”, line 74, in
dataset.push_to_hub(dataset_id,
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py”, line 5422, in push_to_hub
repo_id, split, uploaded_size, dataset_nbytes, repo_files, deleted_size = self._push_parquet_shards_to_hub(
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py”, line 5289, in _push_parquet_shards_to_hub
first_shard = next(shards_iter)
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py”, line 5271, in shards_with_embedded_external_files
shard = shard.map(
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py”, line 592, in wrapper
out: Union[β€œDataset”, β€œDatasetDict”] = func(self, *args, **kwargs)
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py”, line 557, in wrapper
out: Union[β€œDataset”, β€œDatasetDict”] = func(self, *args, **kwargs)
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py”, line 3097, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py”, line 3474, in _map_single
batch = apply_function_on_filtered_inputs(
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py”, line 3353, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.py”, line 2306, in embed_table_storage
arrays = [
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.py”, line 2307, in
embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.py”, line 1831, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.py”, line 1831, in
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.py”, line 2176, in embed_array_storage
return feature.embed_storage(array)
File β€œ/home/ubuntu/venv/lib/python3.10/site-packages/datasets/features/image.py”, line 276, in embed_storage
storage = pa.StructArray.from_arrays([bytes_array, path_array], [β€œbytes”, β€œpath”], mask=bytes_array.is_null())
File β€œpyarrow/array.pxi”, line 2850, in pyarrow.lib.StructArray.from_arrays
File β€œpyarrow/array.pxi”, line 3290, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean
"
Any idea how I can oversome this?

Hi ! Amazing to have an image-text datasets like that, so you plan to share it with the community ?

This exceeds the rate limit for github commits.

Do you mean that you got an error while uploading ? Could you share the error message ?

If I push it like:
dataset.push_to_hub(dataset_id, num_shards=5)
It throws an error:
Any idea how I can oversome this?

You can try increasing the max shard size instead:

dataset.push_to_hub(dataset_id, max_shard_size="2GB")

This should upload less files than by default and still have shards of reasonable size

Hi ! Amazing to have an image-text datasets like that, so you plan to share it with the community ?

Hmm, I dont think I can for now :slight_smile:

Do you mean that you got an error while uploading ? Could you share the error message ?

Yes, during dataset.push_to_hub() call. The error message is:

reating parquet from Arrow format: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:03<00:00,  1.56s/ba]
Pushing dataset shards to the dataset hub:   3%|β–ˆβ–ˆβ–Œ                                                                                                 | 62/2460 [10:15<6:36:28,  9.92s/it]
Traceback (most recent call last):
  File "/home/ubuntu/venv/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 269, in hf_raise_for_status
    response.raise_for_status()
  File "/home/ubuntu/venv/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/ocg2347/xxx/commit/main

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/hf_push.py", line 76, in <module>
    dataset.push_to_hub(dataset_id)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5422, in push_to_hub
    repo_id, split, uploaded_size, dataset_nbytes, repo_files, deleted_size = self._push_parquet_shards_to_hub(
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5308, in _push_parquet_shards_to_hub
    _retry(
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 293, in _retry
    raise err
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 290, in _retry
    return func(*func_args, **func_kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 969, in _inner
    return fn(self, *args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3761, in upload_file
    commit_info = self.create_commit(
  File "/home/ubuntu/venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 969, in _inner
    return fn(self, *args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3148, in create_commit
    hf_raise_for_status(commit_resp, endpoint_name="commit")
  File "/home/ubuntu/venv/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 320, in hf_raise_for_status
    raise HfHubHTTPError(str(e), response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/ocg2347/xxxxx/commit/main (Request ID: Root=1-6533d846-25b5d4455696cd6462bc1422;585ab139-54e7-4fe8-ac3f-ce8288152d5d)

You have exceeded our hourly quotas for action: commit. We invite you to retry later.

You can try increasing the max shard size instead:

It outputs the same error this time:

Map:   0%|                                                                                                                                              | 0/2001 [00:18<?, ? examples/s]
Traceback (most recent call last):
  File "/home/ubuntu/hf_push.py", line 76, in <module>
    dataset.push_to_hub(dataset_id,
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5422, in push_to_hub
    repo_id, split, uploaded_size, dataset_nbytes, repo_files, deleted_size = self._push_parquet_shards_to_hub(
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5289, in _push_parquet_shards_to_hub
    first_shard = next(shards_iter)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5271, in shards_with_embedded_external_files
    shard = shard.map(
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3097, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3474, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.py", line 2306, in embed_table_storage
    arrays = [
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.py", line 2307, in <listcomp>
    embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.py", line 1831, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.py", line 1831, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/table.py", line 2176, in embed_array_storage
    return feature.embed_storage(array)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/datasets/features/image.py", line 276, in embed_storage
    storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
  File "pyarrow/array.pxi", line 2850, in pyarrow.lib.StructArray.from_arrays
  File "pyarrow/array.pxi", line 3290, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean

Bonus:
If I save the dataset generated with the generator to local disk. Then load it and upload it like:

dataset = Dataset.from_generator(dataset_gen, features=features)
dataset.save_to_disk("/path/to/db")
dataset = datasets.load_from_disk("/path/to/db")
dataset.push_to_hub(dataset_id)

It uploads shards of ~1.2GB but again I reach the git commit quota

Good news, we improved how push_to_hub works : it does only one commit every 50 files instead of every file. The improvement is available if you install datasets from source, and it will be available in 2.15 as well

Feel free to try it out

This is great news! Thank you all for that. It seems I managed to solve this :slight_smile:
Feedback for additional improvement can be:

  • adaptive upload frequency that makes sure commit quota limit is not reached.
  • and of course, better implementation of sharding so that user can upload large shards without any problem

Again, thanks a lot :slight_smile:

Have the same problem. One commit per 50 files is still too frequent for me. Can’t customize shard size or number of shards due to the error. Using the datasets version β€˜2.15.0’.

Most likely it is caused by around 1% of entires in the beginning of the dataset being much larger than the rest of the entries. As a result, in the beginning of pushing to the hub, the commits are relatively rare and the shard size is reasonably high (around 400-500 MB), for most of the entries the commits are too frequent and the shard size is extremely small (around 500 KB - 2 MB).

Apparently, the pusher somehow infers the optimal number of entries per shard in the beginning and doesn’t adjust it in the process, which is quite a hard assumption in my case.

Solved for now by setting max shard size to 700 MB:

batch.push_to_hub('zeio/auto-batch', config_name = 'spoken', max_shard_size = '700MB')

That’s correct. And indeed in this case having extremely small (or big) examples at the beginning of the dataset can cause the shards to not have the same sizes. This can indeed be improved though