Problem "Bad request" when using datasets.Dataset.push_to_hub()

I am trying to push my audio dataset to huggingface, by using the function Dataset.push_to_hub() but I got some problems with the following error:

huggingface_hub.utils._errors.BadRequestError:  (Request ID: Root=1-66e44a34-2265d8dd5f0712e1239094bc;329d29d8-f6cd-4121-9cef-3848a280d540)

Bad request:
Your proposed upload is smaller than the minimum allowed size

I want to talk a bit about my context, hope this helpful. In my situation, before pushing my audio dataset to huggingface, I need to do some processing to my dataset. Due to the large volume of my dataset, I can not process the whole dataset at the same time by using Datasets.map(). Therefore, what I did is as follows

  • First, I splitted my dataset into 25 parts (smaller datasets)
  • Next, I applied Dataset.map() for each of these 25 parts in sequential order (each time one part), then I save each part to my disk by Dataset.save_to_disk() after having processed that part.
  • After having processed and saved completely all the 25 parts, I load each of these parts by using datasets.load_from_disk(), and I use datasets.concatenate_datasets() to concatenate the 25 parts to a whole dataset. Finally, I use the function datasets.Dataset.push_to_hub() for the whole dataset in order to push to huggingface.

Below is my full error.

HTTP Error 500 thrown while requesting PUT https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com/repos/5f/7d/5f7dbc96f79ad1a3b092e972838e58d6cec745f81d1bd787f85
c37b48b90c8c2/576e4005b5869a14535512c18922e5b14c5a9eb9ce14e328136abc1cb7eb7807?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credent
ial=AKIA2JU7TKAQLC2QXPN7%2F20240913%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240913T154625Z&X-Amz-Expires=86400&X-Amz-Signature=d484963473b954473d6d730a82d7738
b0df651743bda363dcac3202e3110a827&X-Amz-SignedHeaders=host&partNumber=2&uploadId=knKPR0sUudlf6QzR8gPORZB0WwDRC.L_JMSGlFU6N1TVdHz2Om9VHwbQYCECNPhQ0Qhs4VLgSKT9qKvvNE
8ozxIJLmNvXdcuEFevTghmA5Tlbk.0T2XXdQWy_6.cJXug&x-id=UploadPart
Retrying in 1s [Retry 1/5].
Uploading the dataset shards:  59%|███████████████████████████████████████████████████████                                       | 253/432 [58:46<41:35, 13.94s/it]
Traceback (most recent call last):  
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/home/haons/.local/lib/python3.8/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/complete_multipart?uploadId=knKPR0sUudlf6QzR8gPORZB0WwDRC.L_JMSGlF
U6N1TVdHz2Om9VHwbQYCECNPhQ0Qhs4VLgSKT9qKvvNE8ozxIJLmNvXdcuEFevTghmA5Tlbk.0T2XXdQWy_6.cJXug&bucket=hf-hub-lfs-us-east-1&prefix=repos%2F5f%2F7d%2F5f7dbc96f79ad1a3b09
2e972838e58d6cec745f81d1bd787f85c37b48b90c8c2&expiration=Sat%2C+14+Sep+2024+15%3A46%3A25+GMT&signature=0a7a2ab7291b25f07e26d920f27340762b86ade8b3ed0d46057da74f9b2a
e6e4

The above exception was the direct cause of the following exception:

Traceback (most recent call last):  
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/home/haons/.local/lib/python3.8/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/complete_multipart?uploadId=knKPR0sUudlf6QzR8gPORZB0WwDRC.L_JMSGlF
U6N1TVdHz2Om9VHwbQYCECNPhQ0Qhs4VLgSKT9qKvvNE8ozxIJLmNvXdcuEFevTghmA5Tlbk.0T2XXdQWy_6.cJXug&bucket=hf-hub-lfs-us-east-1&prefix=repos%2F5f%2F7d%2F5f7dbc96f79ad1a3b09
2e972838e58d6cec745f81d1bd787f85c37b48b90c8c2&expiration=Sat%2C+14+Sep+2024+15%3A46%3A25+GMT&signature=0a7a2ab7291b25f07e26d920f27340762b86ade8b3ed0d46057da74f9b2a
e6e4

The above exception was the direct cause of the following exception:

Traceback (most recent call last):  
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 431, in _wrapped_lfs_upload
    lfs_upload(operation=operation, lfs_batch_action=batch_action, headers=headers, endpoint=endpoint)
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/lfs.py", line 246, in lfs_upload
    _upload_multi_part(operation=operation, header=header, chunk_size=chunk_size, upload_url=upload_url)
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/lfs.py", line 355, in _upload_multi_part
    hf_raise_for_status(completion_res)
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 358, in hf_raise_for_status
    raise BadRequestError(message, response=response) from e
huggingface_hub.utils._errors.BadRequestError:  (Request ID: Root=1-66e45e67-50837d1e4ccce5d6497cf9bd;68743be0-430b-4375-803b-ac20d176835c)

Bad request:
Your proposed upload is smaller than the minimum allowed size

The above exception was the direct cause of the following exception:

Traceback (most recent call last):  
  File "/home4/haons/speaker-verification/push_data_to_huggingface/push_data_vsasv_to_huggingface/src/main.py", line 24, in <module>
    dataset.push_to_hub(huggingface_dataset, token= TOKEN)
  File "/home/haons/.local/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 5414, in push_to_hub
    additions, uploaded_size, dataset_nbytes = self._push_parquet_shards_to_hub(
  File "/home/haons/.local/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 5262, in _push_parquet_shards_to_hub
    api.preupload_lfs_files(

  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/hf_api.py", line 4317, in preupload_lfs_files
    _upload_lfs_files(
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 441, in _upload_lfs_files
    _wrapped_lfs_upload(filtered_actions[0])
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 433, in _wrapped_lfs_upload
    raise RuntimeError(f"Error while uploading '{operation.path_in_repo}' to the Hub.") from exc
RuntimeError: Error while uploading 'data/train-00253-of-00432.parquet' to the Hub.
Error in sys.excepthook:
Traceback (most recent call last):  
  File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
    from apport.fileutils import likely_packaged, get_recent_crashes
  File "/usr/lib/python3/dist-packages/apport/__init__.py", line 5, in <module>
    from apport.report import Report
  File "/usr/lib/python3/dist-packages/apport/report.py", line 30, in <module>
    import apport.fileutils
  File "/usr/lib/python3/dist-packages/apport/fileutils.py", line 23, in <module>
    from apport.packaging_impl import impl as packaging
  File "/usr/lib/python3/dist-packages/apport/packaging_impl.py", line 24, in <module>
    import apt
  File "/usr/lib/python3/dist-packages/apt/__init__.py", line 23, in <module>
    import apt_pkg
ModuleNotFoundError: No module named 'apt_pkg'

Original exception was:
Traceback (most recent call last):  
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/home/haons/.local/lib/python3.8/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://huggingface.co/api/complete_multipart?uploadId=knKPR0sUudlf6QzR8gPORZB0WwDRC.L_JMSGlF
U6N1TVdHz2Om9VHwbQYCECNPhQ0Qhs4VLgSKT9qKvvNE8ozxIJLmNvXdcuEFevTghmA5Tlbk.0T2XXdQWy_6.cJXug&bucket=hf-hub-lfs-us-east-1&prefix=repos%2F5f%2F7d%2F5f7dbc96f79ad1a3b09
2e972838e58d6cec745f81d1bd787f85c37b48b90c8c2&expiration=Sat%2C+14+Sep+2024+15%3A46%3A25+GMT&signature=0a7a2ab7291b25f07e26d920f27340762b86ade8b3ed0d46057da74f9b2a
e6e4

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 431, in _wrapped_lfs_upload
    lfs_upload(operation=operation, lfs_batch_action=batch_action, headers=headers, endpoint=endpoint)
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/lfs.py", line 246, in lfs_upload
    _upload_multi_part(operation=operation, header=header, chunk_size=chunk_size, upload_url=upload_url)
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/lfs.py", line 355, in _upload_multi_part
    hf_raise_for_status(completion_res)
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 358, in hf_raise_for_status
    raise BadRequestError(message, response=response) from e
huggingface_hub.utils._errors.BadRequestError:  (Request ID: Root=1-66e45e67-50837d1e4ccce5d6497cf9bd;68743be0-430b-4375-803b-ac20d176835c)

Bad request:
Your proposed upload is smaller than the minimum allowed size

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home4/haons/speaker-verification/push_data_to_huggingface/push_data_vsasv_to_huggingface/src/main.py", line 24, in <module>
    dataset.push_to_hub(huggingface_dataset, token= TOKEN)
  File "/home/haons/.local/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 5414, in push_to_hub
    additions, uploaded_size, dataset_nbytes = self._push_parquet_shards_to_hub(
  File "/home/haons/.local/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 5262, in _push_parquet_shards_to_hub
    api.preupload_lfs_files(
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/hf_api.py", line 4317, in preupload_lfs_files
    _upload_lfs_files(
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 441, in _upload_lfs_files
    _wrapped_lfs_upload(filtered_actions[0])
  File "/home/haons/.local/lib/python3.8/site-packages/huggingface_hub/_commit_api.py", line 433, in _wrapped_lfs_upload
    raise RuntimeError(f"Error while uploading '{operation.path_in_repo}' to the Hub.") from exc
RuntimeError: Error while uploading 'data/train-00253-of-00432.parquet' to the Hub.

Thanks in advance for your help. This is my first post on huggingface discuss forum, so if there is any error, please announce me ! Thanks again for your consideration !

I see that you have a Your proposed upload is smaller than the minimum allowed size in your logs, and I am suspecting that is due to how aws works php - Amazon S3 - Your proposed upload is smaller than the minimum allowed size - Stack Overflow could you verify the size of chunk data/train-00253-of-00432.parquet ?

and let me know which version of the huggingface_hub and of datasets library you’re using ?


reading in the traceback I also saw ModuleNotFoundError: No module named 'apt_pkg' , could you start by fixing that one first if that does not work please report your findings on what I mentioned above.

1 Like

First of all, thank you for your reply. I’m really appreciate.

So, I want to say that during the time I’m searching for the solution, I have re-runned my code without making any changes in the code, I just re-run and hope that this time it will work. And well, it’s work :slight_smile: I don’t know, I feel like it’s kind of randomness ? Anyway, it worked, I pushed my dataset to huggingface sucessfully.
About your recommedation of checking the size of chunk data/train-00253-of-00432.parquet, I don’t know how to check it because at the origin, my dataset is saved in .arrow files, and when I excute the function Dataset.push_to_hub(), the system automatically creates .parquet files from the .arrow files and pushes the created .parquet files to huggingface. So, I do not know how to check the size (and information) of the .parquet files. I guess maybe there is a function that creates .parquet files from the .arrow ? Then if it does, I can use that function first, save the created .parquet files locally and then, check the size of the .parquet files ?

About the versions, now I’m using huggingface_hub of version 0.24.7 and ‘datasets’ of version 3.0.0. I think these are in the latest versions because I just updated those yesterday.

And about the error ModuleNotFoundError: No module named 'apt_pkg', it is in my python3 directory, I trying to figure out how to fix this. I’m working on my company’s server and I haven’t get used to it yet, so it takes me a bit time to figure out the system.

One final thing I want to say is that I’m thinking of an assumption. When the function Dataset.push_to_hub() is executing, I have observed something like one shard is one .parquet file, I guess ? And if that’s true, then there could be one shard that have the size smaller than the minimum allowed size (like in the message of Bad request: says) and that causes the error. So, the solution can be that we need to define either the size of a shard or the number of shards. And well, in the documents of the function Dataset.push_to_hub, there are two parameters to define these values, they are max_shard_size and num_shards. In conclusion, I think the error can be solved by defining those parameters properly (in my code, I do not define them, so their values are default). Anyway that is just my assumption, I haven’t tried it yet. In the future, if I encounter the same error, I will try this solution.
If anyone know the true solution for this error, please comment. And for someone who are also looking for the solution, perhaps you can try to re-run the code and pray :grin: or you can try my assumption by defining the two parameters mentioned above.

@not-lain thanks again for your help. Wish you will read my reply !

1 Like

@shao2011
Thanks for iterating, it might also be a server error, because I have seen some error 500 yesterday while I was navigating the website and some were reported by other members in discord.

Happy this worked for you, and i’m wishing you a good one :sparkles:

1 Like

Thanks ! @not-lain

1 Like

Hi @shao2011, I’ve also encountered this issue, and I resolved this problem by setting smaller num_shards.

3 Likes

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.