Problem pushing dataset to huggingface

I’m trying to push my dataset to the hub using the dataset.push_to_hub but get the following error:

Pushing split train to the Hub.

Domain: work

Pushing dataset shards to the dataset hub: 0%| | 0/1 [00:00<?, ?it/s]

--------------------------------------------------------------------------- TypeError Traceback (most recent call last)
Cell In [37], line 8 6 dataset = dataset.train_test_split(test_size=0.2) 7 domain_datasets[domain] = dataset.remove_columns([“domain”,“index_level_0”])
----> 8 dataset.push_to_hub(f"fathyshalab/{domain}“) 10 domain_datasets[“work”] File ~/.conda/envs/baselines-transformers/lib/python3.9/site-packages/datasets/dataset_dict.py:1350, in DatasetDict.push_to_hub(self, repo_id, private, token, branch, max_shard_size, shard_size, embed_external_files) 1348 logger.warning(f"Pushing split {split} to the Hub.”) 1349 # The split=key needs to be removed before merging → 1350 repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parquet_shards_to_hub( 1351 repo_id, 1352 split=split, 1353 private=private, 1354 token=token, 1355 branch=branch, 1356 max_shard_size=max_shard_size, 1357 embed_external_files=embed_external_files, 1358 ) 1359 total_uploaded_size += uploaded_size 1360 total_dataset_nbytes += dataset_nbytes File ~/.conda/envs/baselines-transformers/lib/python3.9/site-packages/datasets/arrow_dataset.py:4195, in Dataset._push_parquet_shards_to_hub(self, repo_id, split, private, token, branch, max_shard_size, embed_external_files) 4193 shard.to_parquet(buffer)

121 fn_name=fn.name, has_token=has_token, kwargs=kwargs 122 )

→ 124 return fn(*args, **kwargs) TypeError:
upload_file() got an unexpected keyword argument ‘identical_ok’ `

What I understand that in the dataset_arrow.py there is a identical_ok argument that isnt used anymore or added but not included in the upload files function right?

`

1 Like

Can you try to update datasets and huggingface-hub ?

pip install -U datasets huggingface-hub
3 Likes

I had same issue. Did you resolve it? I am pushing a image dataset with captions.

Same issue happening to me when pushing the dataset. Using the same code for my tokenizer uploads correctly. Kindly help, details below:

tokenizer.push_to_hub(repo_name)
timit['train'].push_to_hub(repo_name)

  • Name: huggingface-hub Version: 0.11.1
  • Name: transformers Version: 4.25.1
  • Name: datasets Version: 2.8.0

Can you share the full stack trace please ?

Hello @fathyshalab,

I had the same issue when working with an older Common Voice dataset version. With updating the following versions and taking a newer Common Voice dataset, the problem was solved. Maybe you try this out:

huggingface hub version 0.11.1
transformers version 4.25.1
datasets version 2.8.0

(and, as in the Common Voice case, not “common_voice” but “mozilla-foundation/common_voice_11_0”)

Hope, this helps.

@lhoestq
I am facing the same issue and the stack trace is as below

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[29], line 1
----> 1 medium_datasets.push_to_hub("Kamaljp/medium_articles")

File /opt/conda/lib/python3.10/site-packages/datasets/dataset_dict.py:899, in DatasetDict.push_to_hub(self, repo_id, private, token, branch, shard_size, embed_external_files)
    897 logger.warning(f"Pushing split {split} to the Hub.")
    898 # The split=key needs to be removed before merging
--> 899 repo_id, split, uploaded_size, dataset_nbytes = self[split]._push_parquet_shards_to_hub(
    900     repo_id,
    901     split=split,
    902     private=private,
    903     token=token,
    904     branch=branch,
    905     shard_size=shard_size,
    906     embed_external_files=embed_external_files,
    907 )
    908 total_uploaded_size += uploaded_size
    909 total_dataset_nbytes += dataset_nbytes

File /opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py:3474, in Dataset._push_parquet_shards_to_hub(self, repo_id, split, private, token, branch, shard_size, embed_external_files)
   3472     shard.to_parquet(buffer)
   3473     uploaded_size += buffer.tell()
-> 3474     _retry(
   3475         api.upload_file,
   3476         func_kwargs=dict(
   3477             path_or_fileobj=buffer.getvalue(),
   3478             path_in_repo=path_in_repo(index),
   3479             repo_id=repo_id,
   3480             token=token,
   3481             repo_type="dataset",
   3482             revision=branch,
   3483             identical_ok=True,
   3484         ),
   3485         exceptions=HTTPError,
   3486         status_codes=[504],
   3487         base_wait_time=2.0,
   3488         max_retries=5,
   3489         max_wait_time=20.0,
   3490     )
   3491 return repo_id, split, uploaded_size, dataset_nbytes

File /opt/conda/lib/python3.10/site-packages/datasets/utils/file_utils.py:330, in _retry(func, func_args, func_kwargs, exceptions, status_codes, max_retries, base_wait_time, max_wait_time)
    328 while True:
    329     try:
--> 330         return func(*func_args, **func_kwargs)
    331     except exceptions as err:
    332         if retry >= max_retries or (status_codes and err.response.status_code not in status_codes):

File /opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py:120, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
    117 if check_use_auth_token:
    118     kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 120 return fn(*args, **kwargs)

TypeError: HfApi.upload_file() got an unexpected keyword argument 'identical_ok'

I am trying to push the data from Kaggle notebook to Hugging face hub, the above error is occuring. There is internet enabled in the notebook.
I have been pushing from Colab Notebooks, it is working without any challenge.

Any direction would be appreciated. Thanks

Have you tried updating datasets and huggingface-hub as suggested earlier in the thread ?

I am facing the same issue. Here are the details

Environment: Kaggle Notebook

libraries : pip install datasets transformers=4.28.1

Stack Trace:

TypeError                                 Traceback (most recent call last)
Cell In[23], line 1
----> 1 scraped_ds.push_to_hub("my_data_repo")

File /opt/conda/lib/python3.10/site-packages/datasets/dataset_dict.py:899, in DatasetDict.push_to_hub(self, repo_id, private, token, branch, shard_size, embed_external_files)
    897 logger.warning(f"Pushing split {split} to the Hub.")
    898 # The split=key needs to be removed before merging
--> 899 repo_id, split, uploaded_size, dataset_nbytes = self[split]._push_parquet_shards_to_hub(
    900     repo_id,
    901     split=split,
    902     private=private,
    903     token=token,
    904     branch=branch,
    905     shard_size=shard_size,
    906     embed_external_files=embed_external_files,
    907 )
    908 total_uploaded_size += uploaded_size
    909 total_dataset_nbytes += dataset_nbytes

File /opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py:3474, in Dataset._push_parquet_shards_to_hub(self, repo_id, split, private, token, branch, shard_size, embed_external_files)
   3472     shard.to_parquet(buffer)
   3473     uploaded_size += buffer.tell()
-> 3474     _retry(
   3475         api.upload_file,
   3476         func_kwargs=dict(
   3477             path_or_fileobj=buffer.getvalue(),
   3478             path_in_repo=path_in_repo(index),
   3479             repo_id=repo_id,
   3480             token=token,
   3481             repo_type="dataset",
   3482             revision=branch,
   3483             identical_ok=True,
   3484         ),
   3485         exceptions=HTTPError,
   3486         status_codes=[504],
   3487         base_wait_time=2.0,
   3488         max_retries=5,
   3489         max_wait_time=20.0,
   3490     )
   3491 return repo_id, split, uploaded_size, dataset_nbytes

File /opt/conda/lib/python3.10/site-packages/datasets/utils/file_utils.py:330, in _retry(func, func_args, func_kwargs, exceptions, status_codes, max_retries, base_wait_time, max_wait_time)
    328 while True:
    329     try:
--> 330         return func(*func_args, **func_kwargs)
    331     except exceptions as err:
    332         if retry >= max_retries or (status_codes and err.response.status_code not in status_codes):

File /opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py:118, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
    115 if check_use_auth_token:
    116     kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
--> 118 return fn(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/huggingface_hub/hf_api.py:826, in future_compatible.<locals>._inner(self, *args, **kwargs)
    823     return self.run_as_future(fn, self, *args, **kwargs)
    825 # Otherwise, call the function normally
--> 826 return fn(self, *args, **kwargs)

TypeError: HfApi.upload_file() got an unexpected keyword argument 'identical_ok'

BG: I have repo with same name, and I have pushed the raw data to it already.

I thought the above error occurs when I am pushing the data to the repo which has data which is identical to the “the new data” I am pushing.

I then changed the repo_name. Even after that the error persists.

Kindly update what other data you need to help resolve the issue

You can fix this by updating the datasets installation in the environment to the newest version (2.13).

The issue still continues even after using version 2.13.

If you still get the same error, restart the runtime for the version update to take effect.