DatasetGenerationError: An error occurred while generating the dataset

I am trying to load the data set but this one gives me error

The code which I am using is as below

from datasets import load_dataset
yhavi = load_dataset("yhavinga/ccmatrix", "ur-en")

One thingā€¦ >>>>>>> On first the data set was in a downloading process but during this the chrome was crashed , but now I am not able to downlaod the dataset ,

is this is a problem to delete the cache file ??

@lhoestq @mariosasko

Hi ! Yes feel free to delete the cache file (and any .lock file you may find in the cache directory) and try again

1 Like

How to delete the cache file? I have tried the following code to run in my Jupter notebook

from datasets import Dataset
Dataset.cleanup_cache_files

it does not delete any file if pass any integer value it still does not, how to delete the cache?

1 Like

The cache is by default at ~/.cache/huggingface/datasets, you can delete it

1 Like

hello,

I am still getting the ā€˜DatasetGenerationError: An error occurred while generating the datasetā€™ error.

I got this while running the following;

data_files = ā€œhttps://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zstā€
pubmed_dataset = load_dataset(ā€œjsonā€, data_files=data_files, split=ā€œtrainā€)
pubmed_dataset

If the rootcause is

~/.cache/huggingface/datasets

Can you please advise how does one locate and delete this/

Regards

Hi! Our error message is misleading, but the problem is that this pile URL is not reachable. The next release of datasets will raise: FileNotFoundError: Unable to find 'https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst'

I think the only solution is to use the Parquet export as suggested in How to download data from hugging face that is visible on the data viewer but the files are not available?.

Hi @mariosasko, the datasets.builder.DatasetGenerationError: An error occurred while generating the dataset also occurs while creating issues_dataset as per the HF tutorial - Creating your own dataset - Hugging Face NLP Course . The only thing which worked for me was either using streaming or reading jsonl as pd.read_json with lines=True argument.

How can we load issues_dataset using the datasets API?

After a bit of experiment, the fix which worked for me was loading the *.jsonl file as pd.read_json and then converting it into a Dataset using datasets API.

import pandas as pd
df=pd.read_json("datasets-issues.jsonl", lines=True)
df.head()

from datasets import Dataset
issues_dataset = Dataset.from_pandas(df)
issues_dataset
sample = issues_dataset.shuffle(seed=666).select(range(3))
sample[0]