Problem pushing dataset to huggingface

fathyshalab · November 30, 2022, 9:45am

I’m trying to push my dataset to the hub using the dataset.push_to_hub but get the following error:

Pushing split train to the Hub.

Domain: work

Pushing dataset shards to the dataset hub: 0%| | 0/1 [00:00<?, ?it/s]

--------------------------------------------------------------------------- TypeError Traceback (most recent call last)
Cell In [37], line 8 6 dataset = dataset.train_test_split(test_size=0.2) 7 domain_datasets[domain] = dataset.remove_columns([“domain”,“index_level_0”])
----> 8 dataset.push_to_hub(f"fathyshalab/{domain}“) 10 domain_datasets[“work”] File ~/.conda/envs/baselines-transformers/lib/python3.9/site-packages/datasets/dataset_dict.py:1350, in DatasetDict.push_to_hub(self, repo_id, private, token, branch, max_shard_size, shard_size, embed_external_files) 1348 logger.warning(f"Pushing split {split} to the Hub.”) 1349 # The split=key needs to be removed before merging → 1350 repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parquet_shards_to_hub( 1351 repo_id, 1352 split=split, 1353 private=private, 1354 token=token, 1355 branch=branch, 1356 max_shard_size=max_shard_size, 1357 embed_external_files=embed_external_files, 1358 ) 1359 total_uploaded_size += uploaded_size 1360 total_dataset_nbytes += dataset_nbytes File ~/.conda/envs/baselines-transformers/lib/python3.9/site-packages/datasets/arrow_dataset.py:4195, in Dataset._push_parquet_shards_to_hub(self, repo_id, split, private, token, branch, max_shard_size, embed_external_files) 4193 shard.to_parquet(buffer)

…

121 fn_name=fn.name, has_token=has_token, kwargs=kwargs 122 )

→ 124 return fn(*args, **kwargs) TypeError:
upload_file() got an unexpected keyword argument ‘identical_ok’ `

What I understand that in the dataset_arrow.py there is a identical_ok argument that isnt used anymore or added but not included in the upload files function right?

`

lhoestq · December 1, 2022, 2:47pm

Can you try to update datasets and huggingface-hub ?

pip install -U datasets huggingface-hub

zmao · December 9, 2022, 1:38am

I had same issue. Did you resolve it? I am pushing a image dataset with captions.

eaqui · December 27, 2022, 2:18pm

Same issue happening to me when pushing the dataset. Using the same code for my tokenizer uploads correctly. Kindly help, details below:

tokenizer.push_to_hub(repo_name)
timit['train'].push_to_hub(repo_name)

Name: huggingface-hub Version: 0.11.1
Name: transformers Version: 4.25.1
Name: datasets Version: 2.8.0

lhoestq · January 3, 2023, 10:46am

Can you share the full stack trace please ?

Kristinabckr · January 3, 2023, 6:09pm

Hello @fathyshalab,

I had the same issue when working with an older Common Voice dataset version. With updating the following versions and taking a newer Common Voice dataset, the problem was solved. Maybe you try this out:

huggingface hub version 0.11.1
transformers version 4.25.1
datasets version 2.8.0

(and, as in the Common Voice case, not “common_voice” but “mozilla-foundation/common_voice_11_0”)

Hope, this helps.

Kamaljp · June 11, 2023, 9:24am

@lhoestq
I am facing the same issue and the stack trace is as below

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[29], line 1
----> 1 medium_datasets.push_to_hub("Kamaljp/medium_articles")

File /opt/conda/lib/python3.10/site-packages/datasets/dataset_dict.py:899, in DatasetDict.push_to_hub(self, repo_id, private, token, branch, shard_size, embed_external_files)
    897 logger.warning(f"Pushing split {split} to the Hub.")
    898 # The split=key needs to be removed before merging
--> 899 repo_id, split, uploaded_size, dataset_nbytes = self[split]._push_parquet_shards_to_hub(
    900     repo_id,
    901     split=split,
    902     private=private,
    903     token=token,
    904     branch=branch,
    905     shard_size=shard_size,
    906     embed_external_files=embed_external_files,
    907 )
    908 total_uploaded_size += uploaded_size
    909 total_dataset_nbytes += dataset_nbytes

File /opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py:3474, in Dataset._push_parquet_shards_to_hub(self, repo_id, split, private, token, branch, shard_size, embed_external_files)
   3472     shard.to_parquet(buffer)
   3473     uploaded_size += buffer.tell()
-> 3474     _retry(
   3475         api.upload_file,
   3476         func_kwargs=dict(
   3477             path_or_fileobj=buffer.getvalue(),
   3478             path_in_repo=path_in_repo(index),
   3479             repo_id=repo_id,
   3480             token=token,
   3481             repo_type="dataset",
   3482             revision=branch,
   3483             identical_ok=True,
   3484         ),
   3485         exceptions=HTTPError,
   3486         status_codes=[504],
   3487         base_wait_time=2.0,
   3488         max_retries=5,
   3489         max_wait_time=20.0,
   3490     )
   3491 return repo_id, split, uploaded_size, dataset_nbytes

File /opt/conda/lib/python3.10/site-packages/datasets/utils/file_utils.py:330, in _retry(func, func_args, func_kwargs, exceptions, status_codes, max_retries, base_wait_time, max_wait_time)
    328 while True:
    329     try:
--> 330         return func(*func_args, **func_kwargs)
    331     except exceptions as err:
    332         if retry >= max_retries or (status_codes and err.response.status_code not in status_codes):

File /opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py:120, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
    117 if check_use_auth_token:
    118     kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 120 return fn(*args, **kwargs)

TypeError: HfApi.upload_file() got an unexpected keyword argument 'identical_ok'

I am trying to push the data from Kaggle notebook to Hugging face hub, the above error is occuring. There is internet enabled in the notebook.
I have been pushing from Colab Notebooks, it is working without any challenge.

Any direction would be appreciated. Thanks

lhoestq · June 12, 2023, 9:38am

Have you tried updating datasets and huggingface-hub as suggested earlier in the thread ?

Kamaljp · June 25, 2023, 3:08am

I am facing the same issue. Here are the details

Environment: Kaggle Notebook

libraries : pip install datasets transformers=4.28.1

Stack Trace:

TypeError                                 Traceback (most recent call last)
Cell In[23], line 1
----> 1 scraped_ds.push_to_hub("my_data_repo")

File /opt/conda/lib/python3.10/site-packages/datasets/dataset_dict.py:899, in DatasetDict.push_to_hub(self, repo_id, private, token, branch, shard_size, embed_external_files)
    897 logger.warning(f"Pushing split {split} to the Hub.")
    898 # The split=key needs to be removed before merging
--> 899 repo_id, split, uploaded_size, dataset_nbytes = self[split]._push_parquet_shards_to_hub(
    900     repo_id,
    901     split=split,
    902     private=private,
    903     token=token,
    904     branch=branch,
    905     shard_size=shard_size,
    906     embed_external_files=embed_external_files,
    907 )
    908 total_uploaded_size += uploaded_size
    909 total_dataset_nbytes += dataset_nbytes

File /opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py:3474, in Dataset._push_parquet_shards_to_hub(self, repo_id, split, private, token, branch, shard_size, embed_external_files)
   3472     shard.to_parquet(buffer)
   3473     uploaded_size += buffer.tell()
-> 3474     _retry(
   3475         api.upload_file,
   3476         func_kwargs=dict(
   3477             path_or_fileobj=buffer.getvalue(),
   3478             path_in_repo=path_in_repo(index),
   3479             repo_id=repo_id,
   3480             token=token,
   3481             repo_type="dataset",
   3482             revision=branch,
   3483             identical_ok=True,
   3484         ),
   3485         exceptions=HTTPError,
   3486         status_codes=[504],
   3487         base_wait_time=2.0,
   3488         max_retries=5,
   3489         max_wait_time=20.0,
   3490     )
   3491 return repo_id, split, uploaded_size, dataset_nbytes

File /opt/conda/lib/python3.10/site-packages/datasets/utils/file_utils.py:330, in _retry(func, func_args, func_kwargs, exceptions, status_codes, max_retries, base_wait_time, max_wait_time)
    328 while True:
    329     try:
--> 330         return func(*func_args, **func_kwargs)
    331     except exceptions as err:
    332         if retry >= max_retries or (status_codes and err.response.status_code not in status_codes):

File /opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py:118, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
    115 if check_use_auth_token:
    116     kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 118 return fn(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/huggingface_hub/hf_api.py:826, in future_compatible.<locals>._inner(self, *args, **kwargs)
    823     return self.run_as_future(fn, self, *args, **kwargs)
    825 # Otherwise, call the function normally
--> 826 return fn(self, *args, **kwargs)

TypeError: HfApi.upload_file() got an unexpected keyword argument 'identical_ok'

BG: I have repo with same name, and I have pushed the raw data to it already.

I thought the above error occurs when I am pushing the data to the repo which has data which is identical to the “the new data” I am pushing.

I then changed the repo_name. Even after that the error persists.

Kindly update what other data you need to help resolve the issue

mariosasko · June 26, 2023, 10:01am

You can fix this by updating the datasets installation in the environment to the newest version (2.13).

Kamaljp · June 26, 2023, 11:26am

The issue still continues even after using version 2.13.

mariosasko · June 26, 2023, 2:01pm

If you still get the same error, restart the runtime for the version update to take effect.

Topic		Replies	Views
HTTP 400 on push_to_hub with datasets 🤗Hub	2	1078	October 11, 2022
Problem "Bad request" when using datasets.Dataset.push_to_hub() 🤗Datasets	6	491	October 28, 2024
Push_to_hub doesn't overwrite 🤗Datasets	0	691	August 1, 2023
Exceeded our hourly quotas for action while loading dataset to HF Hub 🤗Datasets	9	1449	November 7, 2023
Pushing dataset images to Hub 🤗Datasets	4	2677	October 25, 2022

Problem pushing dataset to huggingface

Related topics