I uploaded a dataset through huggface web interface. But i can't load it!

raygx · May 13, 2023, 8:08am

Dataset: raygx/NepaliTextCorpus
Every package and modules are upto date

ERROR Trace Below
Downloading and preparing dataset json/raygx–NepaliTextCorpus to /root/.cache/huggingface/datasets/raygx___json/raygx–NepaliTextCorpus-172878a4edc47604/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e…

Downloading data files: 100%
1/1 [00:00<00:00, 69.76it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 58.73it/s]

ValueErrorTraceback (most recent call last)
File /usr/local/lib/python3.8/dist-packages/datasets/builder.py:1875, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
1868 writer = writer_class(
1869 features=writer._features,
1870 path=fpath.replace(“SSSSS”, f"{shard_id:05d}“).replace(“JJJJJ”, f”{job_id:05d}"),
(…)
1873 embed_local_files=embed_local_files,
1874 )
→ 1875 writer.write_table(table)
1876 num_examples_progress_update += len(table)

File /usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py:568, in ArrowWriter.write_table(self, pa_table, writer_batch_size)
567 pa_table = pa_table.combine_chunks()
→ 568 pa_table = table_cast(pa_table, self._schema)
569 if self.embed_local_files:

File /usr/local/lib/python3.8/dist-packages/datasets/table.py:2290, in table_cast(table, schema)
2289 if table.schema != schema:
→ 2290 return cast_table_to_schema(table, schema)
2291 elif table.schema.metadata != schema.metadata:

File /usr/local/lib/python3.8/dist-packages/datasets/table.py:2248, in cast_table_to_schema(table, schema)
2247 if sorted(table.column_names) != sorted(features):
→ 2248 raise ValueError(f"Couldn’t cast\n{table.schema}\nto\n{features}\nbecause column names don’t match")
2249 arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]

ValueError: Couldn’t cast
_data_files: list<item: struct<filename: string>>
child 0, item: struct<filename: string>
child 0, filename: string
_fingerprint: string
_format_columns: null
_format_kwargs: struct<>
_format_type: null
_indexes: struct<>
_output_all_columns: bool
_split: null
to
{‘builder_name’: Value(dtype=‘null’, id=None), ‘citation’: Value(dtype=‘string’, id=None), ‘config_name’: Value(dtype=‘null’, id=None), ‘dataset_size’: Value(dtype=‘null’, id=None), ‘description’: Value(dtype=‘string’, id=None), ‘download_checksums’: Value(dtype=‘null’, id=None), ‘download_size’: Value(dtype=‘null’, id=None), ‘features’: {‘text’: {‘dtype’: Value(dtype=‘string’, id=None), ‘id’: Value(dtype=‘null’, id=None), ‘_type’: Value(dtype=‘string’, id=None)}}, ‘homepage’: Value(dtype=‘string’, id=None), ‘license’: Value(dtype=‘string’, id=None), ‘post_processed’: Value(dtype=‘null’, id=None), ‘post_processing_size’: Value(dtype=‘null’, id=None), ‘size_in_bytes’: Value(dtype=‘null’, id=None), ‘splits’: Value(dtype=‘null’, id=None), ‘supervised_keys’: Value(dtype=‘null’, id=None), ‘task_templates’: Value(dtype=‘null’, id=None), ‘version’: Value(dtype=‘null’, id=None)}
because column names don’t match

The above exception was the direct cause of the following exception:

DatasetGenerationErrorTraceback (most recent call last)
File :4

File /usr/local/lib/python3.8/dist-packages/datasets/load.py:1791, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
1788 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
1790 # Download and prepare data
→ 1791 builder_instance.download_and_prepare(
1792 download_config=download_config,
1793 download_mode=download_mode,
1794 verification_mode=verification_mode,
1795 try_from_hf_gcs=try_from_hf_gcs,
1796 num_proc=num_proc,
1797 storage_options=storage_options,
1798 )
1800 # Build dataset for splits
1801 keep_in_memory = (
1802 keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
1803 )

File /usr/local/lib/python3.8/dist-packages/datasets/builder.py:891, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
889 if num_proc is not None:
890 prepare_split_kwargs[“num_proc”] = num_proc
→ 891 self._download_and_prepare(
892 dl_manager=dl_manager,
893 verification_mode=verification_mode,
894 **prepare_split_kwargs,
895 **download_and_prepare_kwargs,
896 )
897 # Sync info
898 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File /usr/local/lib/python3.8/dist-packages/datasets/builder.py:986, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
982 split_dict.add(split_generator.split_info)
984 try:
985 # Prepare split will record examples associated to the split
→ 986 self._prepare_split(split_generator, **prepare_split_kwargs)
987 except OSError as e:
988 raise OSError(
989 "Cannot find data file. "
990 + (self.manual_download_instructions or “”)
991 + “\nOriginal error:\n”
992 + str(e)
993 ) from None

File /usr/local/lib/python3.8/dist-packages/datasets/builder.py:1748, in ArrowBasedBuilder._prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
1746 job_id = 0
1747 with pbar:
→ 1748 for job_id, done, content in self._prepare_split_single(
1749 gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
1750 ):
1751 if done:
1752 result = content

File /usr/local/lib/python3.8/dist-packages/datasets/builder.py:1893, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
1891 if isinstance(e, SchemaInferenceError) and e.context is not None:
1892 e = e.context
→ 1893 raise DatasetGenerationError(“An error occurred while generating the dataset”) from e
1895 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

mariosasko · May 13, 2023, 1:34pm

Hi! You should use push_to_hub to upload the dataset, not the files created by save_to_disk.

raygx · May 14, 2023, 1:15am

Thank you for the response…
I tried to use push_to_hub() api, but it failed.
I tried through kaggle Notebook and there was issue.

something regarding the version, even though i ran !pip install transformers, datasets

I tried from my local and there was this issue.

HfHubHTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/datasets/raygx/NepaliCorpus/commit/main (Request ID: Root=1-6460363c-68ba62ac562ca61a5260d4d4)
Forbidden: pass create_pr=1 as a query parameter to create a Pull Request

Any Idea?

mariosasko · May 14, 2023, 1:05pm

You need to login to fix this error.

Topic		Replies	Views
Couldn't find 'my_dataset' on the Hugging Face Hub 🤗Datasets	4	3262	May 2, 2023
Traceback while loading image dataset 🤗Datasets	1	653	July 20, 2022
Can’t generate my own dataset using load_dataset Beginners	1	171	May 7, 2024
Problem pushing dataset to huggingface 🤗Datasets	11	3631	June 26, 2023
Colab cannot find HuggingFace dataset 🤗Datasets	7	4578	April 28, 2025

I uploaded a dataset through huggface web interface. But i can't load it!

Related topics