Get_dataset_config_names not getting desired output (and DatasetGenerationError)

SleepTight · December 10, 2024, 6:28pm

Hi, refering to the HuggingFace documentation, I’m trying to load the “opus_books” dataset and I see from its webpage that there is an ‘en-fr’ subset. However when I run

get_dataset_config_names("opus_books")

I only get

['ca-de']

I don’t understand why ‘en-fr’ and all the other options are not showing here.

And when I tried

load_dataset("opus_books", "en-fr", split='train')

the program gives me output as below:

Downloading and preparing dataset None/ca-de to C:/Users/UserName/.cache/huggingface/datasets/parquet/ca-de-8239290e5e0370f8/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...

And later I get:

TypeError: Couldn't cast array of type
struct<ca: string, en: string>
to
struct<ca: string, de: string>

from

  File "train.py", line 154, in <module>
    train_model(config)
  File "train.py", line 87, in train_model
    train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt = get_ds(config)
  File "train.py", line 44, in get_ds
    ds_raw =load_dataset("opus_books", "en-fr", split='train')
  File "C:\Users\UserName\miniconda3\envs\base\lib\site-packages\datasets\load.py", line 1815, in load_dataset
    storage_options=storage_options,
  File "C:\Users\UserName\miniconda3\envs\base\lib\site-packages\datasets\builder.py", line 913, in download_and_prepare
    **download_and_prepare_kwargs,
  File "C:\Users\UserName\miniconda3\envs\base\lib\site-packages\datasets\builder.py", line 1004, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "C:\Users\UserName\miniconda3\envs\base\lib\site-packages\datasets\builder.py", line 1768, in _prepare_split
    gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
  File "C:\UserName\miniconda3\envs\base\lib\site-packages\datasets\builder.py", line 1912, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

I don’t quite understand where did I get wrong and would you please help me with this?

SleepTight · December 11, 2024, 3:23am

I’ve also tried

get_dataset_config_names("glue")

and it gives only one item from the complete list

['ax']

I checked the version of the package is datasets-2.13.2, and it’s the highest I can get under my Python 3.7 environment.
I also tried the same code in Google Colab with Python 3.10 and datasets-3.2.0. Everthing is working fine there.
Can I conclude that it’s a problem with the package version? What should I do if I want to use the package while keeping my Python 3.7 environment for compatibility with other packages I have?

John6666 · December 11, 2024, 5:06am

What should I do if I want to use the package while keeping my Python 3.7 environment for compatibility with other packages I have?

I’m lazy, so I don’t use it, but I think the general method is to use a virtual environment such as venv.
It eats up extra HDD space for Python and libraries, but that’s the only drawback, and the benefits are huge, especially when it comes to running AI. You don’t have to worry about library dependencies anymore.

lhoestq · December 11, 2024, 3:35pm

Hi ! You should use a more recent version of datasets with a more recent of python.

Old versions of datasets may not support multi-subset datasets.

SleepTight · December 11, 2024, 5:43pm

Thank you! I tried to run the program on Colab with a newer version of datasets and everything works fine.

system · December 12, 2024, 5:44am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bug with datasets configs? 🤗Datasets	6	251	September 7, 2023
Problem with dataset config Beginners	5	244	January 18, 2025
DatasetGenerationError. Failed to parse string: as a scalar of type double Beginners	3	92	January 7, 2025
Can’t generate my own dataset using load_dataset Beginners	1	171	May 7, 2024
Dataset viewer crashes after generating parquet files from convert_to_parquet 🤗Datasets	1	36	April 15, 2025

Get_dataset_config_names not getting desired output (and DatasetGenerationError)

Related topics