Get_dataset_config_names not getting desired output (and DatasetGenerationError)

Hi, refering to the HuggingFace documentation, Iā€™m trying to load the ā€œopus_booksā€ dataset and I see from its webpage that there is an ā€˜en-frā€™ subset. However when I run

get_dataset_config_names("opus_books")

I only get

['ca-de']

I donā€™t understand why ā€˜en-frā€™ and all the other options are not showing here.

And when I tried

load_dataset("opus_books", "en-fr", split='train')

the program gives me output as below:

Downloading and preparing dataset None/ca-de to C:/Users/UserName/.cache/huggingface/datasets/parquet/ca-de-8239290e5e0370f8/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...

And later I get:

TypeError: Couldn't cast array of type
struct<ca: string, en: string>
to
struct<ca: string, de: string>

from

  File "train.py", line 154, in <module>
    train_model(config)
  File "train.py", line 87, in train_model
    train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt = get_ds(config)
  File "train.py", line 44, in get_ds
    ds_raw =load_dataset("opus_books", "en-fr", split='train')
  File "C:\Users\UserName\miniconda3\envs\base\lib\site-packages\datasets\load.py", line 1815, in load_dataset
    storage_options=storage_options,
  File "C:\Users\UserName\miniconda3\envs\base\lib\site-packages\datasets\builder.py", line 913, in download_and_prepare
    **download_and_prepare_kwargs,
  File "C:\Users\UserName\miniconda3\envs\base\lib\site-packages\datasets\builder.py", line 1004, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "C:\Users\UserName\miniconda3\envs\base\lib\site-packages\datasets\builder.py", line 1768, in _prepare_split
    gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
  File "C:\UserName\miniconda3\envs\base\lib\site-packages\datasets\builder.py", line 1912, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

I donā€™t quite understand where did I get wrong and would you please help me with this?

1 Like

Iā€™ve also tried

get_dataset_config_names("glue")

and it gives only one item from the complete list

['ax']

I checked the version of the package is datasets-2.13.2, and itā€™s the highest I can get under my Python 3.7 environment.
I also tried the same code in Google Colab with Python 3.10 and datasets-3.2.0. Everthing is working fine there.
Can I conclude that itā€™s a problem with the package version? What should I do if I want to use the package while keeping my Python 3.7 environment for compatibility with other packages I have?

1 Like

What should I do if I want to use the package while keeping my Python 3.7 environment for compatibility with other packages I have?

Iā€™m lazy, so I donā€™t use it,:sweat_smile: but I think the general method is to use a virtual environment such as venv.
It eats up extra HDD space for Python and libraries, but thatā€™s the only drawback, and the benefits are huge, especially when it comes to running AI. You donā€™t have to worry about library dependencies anymore.

Hi ! You should use a more recent version of datasets with a more recent of python.

Old versions of datasets may not support multi-subset datasets.

2 Likes

Thank you! I tried to run the program on Colab with a newer version of datasets and everything works fine.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.