Problem pushing dataset to huggingface

I’m trying to push my dataset to the hub using the dataset.push_to_hub but get the following error:

Pushing split train to the Hub.

Domain: work

Pushing dataset shards to the dataset hub: 0%| | 0/1 [00:00<?, ?it/s]

--------------------------------------------------------------------------- TypeError Traceback (most recent call last)
Cell In [37], line 8 6 dataset = dataset.train_test_split(test_size=0.2) 7 domain_datasets[domain] = dataset.remove_columns([“domain”,“index_level_0”])
----> 8 dataset.push_to_hub(f"fathyshalab/{domain}“) 10 domain_datasets[“work”] File ~/.conda/envs/baselines-transformers/lib/python3.9/site-packages/datasets/dataset_dict.py:1350, in DatasetDict.push_to_hub(self, repo_id, private, token, branch, max_shard_size, shard_size, embed_external_files) 1348 logger.warning(f"Pushing split {split} to the Hub.”) 1349 # The split=key needs to be removed before merging → 1350 repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parquet_shards_to_hub( 1351 repo_id, 1352 split=split, 1353 private=private, 1354 token=token, 1355 branch=branch, 1356 max_shard_size=max_shard_size, 1357 embed_external_files=embed_external_files, 1358 ) 1359 total_uploaded_size += uploaded_size 1360 total_dataset_nbytes += dataset_nbytes File ~/.conda/envs/baselines-transformers/lib/python3.9/site-packages/datasets/arrow_dataset.py:4195, in Dataset._push_parquet_shards_to_hub(self, repo_id, split, private, token, branch, max_shard_size, embed_external_files) 4193 shard.to_parquet(buffer)

121 fn_name=fn.name, has_token=has_token, kwargs=kwargs 122 )

→ 124 return fn(*args, **kwargs) TypeError:
upload_file() got an unexpected keyword argument ‘identical_ok’ `

What I understand that in the dataset_arrow.py there is a identical_ok argument that isnt used anymore or added but not included in the upload files function right?

`

1 Like

Can you try to update datasets and huggingface-hub ?

pip install -U datasets huggingface-hub

I had same issue. Did you resolve it? I am pushing a image dataset with captions.

Same issue happening to me when pushing the dataset. Using the same code for my tokenizer uploads correctly. Kindly help, details below:

tokenizer.push_to_hub(repo_name)
timit['train'].push_to_hub(repo_name)

  • Name: huggingface-hub Version: 0.11.1
  • Name: transformers Version: 4.25.1
  • Name: datasets Version: 2.8.0

Can you share the full stack trace please ?

Hello @fathyshalab,

I had the same issue when working with an older Common Voice dataset version. With updating the following versions and taking a newer Common Voice dataset, the problem was solved. Maybe you try this out:

huggingface hub version 0.11.1
transformers version 4.25.1
datasets version 2.8.0

(and, as in the Common Voice case, not “common_voice” but “mozilla-foundation/common_voice_11_0”)

Hope, this helps.