DatasetGenerationError: An error occurred while generating the dataset

Niazi · June 5, 2023, 3:16pm

I am trying to load the data set but this one gives me error

The code which I am using is as below

from datasets import load_dataset
yhavi = load_dataset("yhavinga/ccmatrix", "ur-en")

One thing… >>>>>>> On first the data set was in a downloading process but during this the chrome was crashed , but now I am not able to downlaod the dataset ,

is this is a problem to delete the cache file ??

Niazi · June 5, 2023, 3:18pm

@lhoestq @mariosasko

lhoestq · June 6, 2023, 9:31am

Hi ! Yes feel free to delete the cache file (and any .lock file you may find in the cache directory) and try again

Niazi · June 6, 2023, 9:43am

How to delete the cache file? I have tried the following code to run in my Jupter notebook

from datasets import Dataset
Dataset.cleanup_cache_files

it does not delete any file if pass any integer value it still does not, how to delete the cache?

lhoestq · June 6, 2023, 2:02pm

The cache is by default at ~/.cache/huggingface/datasets, you can delete it

vijaypq · September 2, 2023, 7:22am

hello,

I am still getting the ‘DatasetGenerationError: An error occurred while generating the dataset’ error.

I got this while running the following;

data_files = “https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst”
pubmed_dataset = load_dataset(“json”, data_files=data_files, split=“train”)
pubmed_dataset

If the rootcause is

~/.cache/huggingface/datasets

Can you please advise how does one locate and delete this/

Regards

mariosasko · September 4, 2023, 6:22pm

Hi! Our error message is misleading, but the problem is that this pile URL is not reachable. The next release of datasets will raise: FileNotFoundError: Unable to find 'https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst'

I think the only solution is to use the Parquet export as suggested in How to download data from hugging face that is visible on the data viewer but the files are not available?.

adeep028 · September 13, 2023, 4:59pm

Hi @mariosasko, the datasets.builder.DatasetGenerationError: An error occurred while generating the dataset also occurs while creating issues_dataset as per the HF tutorial - Creating your own dataset - Hugging Face NLP Course . The only thing which worked for me was either using streaming or reading jsonl as pd.read_json with lines=True argument.

How can we load issues_dataset using the datasets API?

adeep028 · September 13, 2023, 5:30pm

After a bit of experiment, the fix which worked for me was loading the *.jsonl file as pd.read_json and then converting it into a Dataset using datasets API.

import pandas as pd
df=pd.read_json("datasets-issues.jsonl", lines=True)
df.head()

from datasets import Dataset
issues_dataset = Dataset.from_pandas(df)
issues_dataset
sample = issues_dataset.shuffle(seed=666).select(range(3))
sample[0]

system · January 18, 2024, 2:02pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Load_dataset() loading csv file show error 🤗Datasets	2	819	April 26, 2023
DatasetGenerationError while loading dataset Beginners	3	2208	October 26, 2023
Error, dataset could not be generated 🤗Datasets	2	971	June 17, 2023
Error while downloading my dataset 🤗Datasets	2	1115	June 21, 2023
Load_dataset without saving cache files 🤗Datasets	1	1819	April 19, 2023

DatasetGenerationError: An error occurred while generating the dataset

Related topics