Get_dataset_config_names not getting desired output (and DatasetGenerationError)

Hi, refering to the HuggingFace documentation, I’m trying to load the ā€œopus_booksā€ dataset and I see from its webpage that there is an ā€˜en-fr’ subset. However when I run

get_dataset_config_names("opus_books")

I only get

['ca-de']

I don’t understand why ā€˜en-fr’ and all the other options are not showing here.

And when I tried

load_dataset("opus_books", "en-fr", split='train')

the program gives me output as below:

Downloading and preparing dataset None/ca-de to C:/Users/UserName/.cache/huggingface/datasets/parquet/ca-de-8239290e5e0370f8/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...

And later I get:

TypeError: Couldn't cast array of type
struct<ca: string, en: string>
to
struct<ca: string, de: string>

from

  File "train.py", line 154, in <module>
    train_model(config)
  File "train.py", line 87, in train_model
    train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt = get_ds(config)
  File "train.py", line 44, in get_ds
    ds_raw =load_dataset("opus_books", "en-fr", split='train')
  File "C:\Users\UserName\miniconda3\envs\base\lib\site-packages\datasets\load.py", line 1815, in load_dataset
    storage_options=storage_options,
  File "C:\Users\UserName\miniconda3\envs\base\lib\site-packages\datasets\builder.py", line 913, in download_and_prepare
    **download_and_prepare_kwargs,
  File "C:\Users\UserName\miniconda3\envs\base\lib\site-packages\datasets\builder.py", line 1004, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "C:\Users\UserName\miniconda3\envs\base\lib\site-packages\datasets\builder.py", line 1768, in _prepare_split
    gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
  File "C:\UserName\miniconda3\envs\base\lib\site-packages\datasets\builder.py", line 1912, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

I don’t quite understand where did I get wrong and would you please help me with this?

1 Like

I’ve also tried

get_dataset_config_names("glue")

and it gives only one item from the complete list

['ax']

I checked the version of the package is datasets-2.13.2, and it’s the highest I can get under my Python 3.7 environment.
I also tried the same code in Google Colab with Python 3.10 and datasets-3.2.0. Everthing is working fine there.
Can I conclude that it’s a problem with the package version? What should I do if I want to use the package while keeping my Python 3.7 environment for compatibility with other packages I have?

1 Like

What should I do if I want to use the package while keeping my Python 3.7 environment for compatibility with other packages I have?

I’m lazy, so I don’t use it,:sweat_smile: but I think the general method is to use a virtual environment such as venv.
It eats up extra HDD space for Python and libraries, but that’s the only drawback, and the benefits are huge, especially when it comes to running AI. You don’t have to worry about library dependencies anymore.

Hi ! You should use a more recent version of datasets with a more recent of python.

Old versions of datasets may not support multi-subset datasets.

2 Likes

Thank you! I tried to run the program on Colab with a newer version of datasets and everything works fine.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.