Load_dataset split=‘test’ not working again

denis-kazakov · April 19, 2025, 8:11am

I know this question was already asked here Load_dataset split='test' not working but it does not help.
I successfully downloaded the test split of the Bulgarian Common Voice dataset less than a month ago. Now I try to do the same with other languages but it starts downloading the train split. Upgrading the datasets library did not help.
I run it in a Jupyter notebook with this command:

cv_17 = load_dataset(
“mozilla-foundation/common_voice_17_0”,
“pl”,
split=“test”,
cache_dir=‘/media/denis/D/Datasets/Audio/’,
token=‘’
)

Environment data:
datasets-cli env

Copy-and-paste the text below in your GitHub issue.

datasets version: 3.5.0
Platform: Linux-6.8.0-58-generic-x86_64-with-glibc2.39
Python version: 3.11.11
huggingface_hub version: 0.29.3
PyArrow version: 19.0.0
Pandas version: 2.2.3
fsspec version: 2024.12.0

denis-kazakov · April 19, 2025, 8:28am

P.S. I just did it for Lithuanian. Apparently, it downloaded all splits and then threw away the train and dev splits. This is the output of cv_17:
Dataset({
features: [‘client_id’, ‘path’, ‘audio’, ‘sentence’, ‘up_votes’, ‘down_votes’, ‘age’, ‘gender’, ‘accent’, ‘locale’, ‘segment’, ‘variant’],
num_rows: 4753
})

denis-kazakov · April 19, 2025, 8:30am

They are not thrown away. They are hidden in cache. But I still would like to load only the split that I need to save bandwidth and disk space with larger datasets.

John6666 · April 19, 2025, 8:46am

Hmm… It seems to work…

from datasets import load_dataset

HF_TOKEN = "hf_my_valid_read_token_***"

cv_17 = load_dataset(
  "mozilla-foundation/common_voice_17_0",
  "pl",
  split="test",
  #cache_dir='/media/denis/D/Datasets/Audio/',
  #token='',
  token=HF_TOKEN,
  trust_remote_code=True # added
)

print(cv_17)
# Dataset({
#     features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
#     num_rows: 9230
# })

"""
Copy-and-paste the text below in your GitHub issue.

- `datasets` version: 3.2.0
- Platform: Windows-10-10.0.19045-SP0
- Python version: 3.9.13
- `huggingface_hub` version: 0.30.2
- PyArrow version: 18.1.0
- Pandas version: 2.2.2
- `fsspec` version: 2024.5.0
"""

Topic		Replies	Views
Load_dataset split='test' not working 🤗Datasets	2	909	February 8, 2024
Using load_datasets for newly created datasets 🤗Datasets	2	456	August 27, 2021
How can I download a specific split of a dataset? 🤗Datasets	1	1251	April 3, 2024
How to load only test dataset from `librispeech_asr`? 🤗Datasets	2	2908	December 7, 2021
Datasets.load_dataset not returning 'eval' or 'test' 🤗Datasets	2	686	May 17, 2022

Load_dataset split=‘test’ not working again

Related topics