Load_dataset split=‘test’ not working again

I know this question was already asked here Load_dataset split='test' not working but it does not help.
I successfully downloaded the test split of the Bulgarian Common Voice dataset less than a month ago. Now I try to do the same with other languages but it starts downloading the train split. Upgrading the datasets library did not help.
I run it in a Jupyter notebook with this command:

cv_17 = load_dataset(
“mozilla-foundation/common_voice_17_0”,
“pl”,
split=“test”,
cache_dir=‘/media/denis/D/Datasets/Audio/’,
token=‘’
)

Environment data:
datasets-cli env

Copy-and-paste the text below in your GitHub issue.

  • datasets version: 3.5.0
  • Platform: Linux-6.8.0-58-generic-x86_64-with-glibc2.39
  • Python version: 3.11.11
  • huggingface_hub version: 0.29.3
  • PyArrow version: 19.0.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.12.0
1 Like

P.S. I just did it for Lithuanian. Apparently, it downloaded all splits and then threw away the train and dev splits. This is the output of cv_17:
Dataset({
features: [‘client_id’, ‘path’, ‘audio’, ‘sentence’, ‘up_votes’, ‘down_votes’, ‘age’, ‘gender’, ‘accent’, ‘locale’, ‘segment’, ‘variant’],
num_rows: 4753
})

1 Like

They are not thrown away. They are hidden in cache. But I still would like to load only the split that I need to save bandwidth and disk space with larger datasets.

1 Like

Hmm… It seems to work…

from datasets import load_dataset

HF_TOKEN = "hf_my_valid_read_token_***"

cv_17 = load_dataset(
  "mozilla-foundation/common_voice_17_0",
  "pl",
  split="test",
  #cache_dir='/media/denis/D/Datasets/Audio/',
  #token='',
  token=HF_TOKEN,
  trust_remote_code=True # added
)

print(cv_17)
# Dataset({
#     features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
#     num_rows: 9230
# })

"""
Copy-and-paste the text below in your GitHub issue.

- `datasets` version: 3.2.0
- Platform: Windows-10-10.0.19045-SP0
- Python version: 3.9.13
- `huggingface_hub` version: 0.30.2
- PyArrow version: 18.1.0
- Pandas version: 2.2.2
- `fsspec` version: 2024.5.0
"""