I know this question was already asked here Load_dataset split='test' not working but it does not help.
I successfully downloaded the test split of the Bulgarian Common Voice dataset less than a month ago. Now I try to do the same with other languages but it starts downloading the train split. Upgrading the datasets library did not help.
I run it in a Jupyter notebook with this command:
cv_17 = load_dataset(
“mozilla-foundation/common_voice_17_0”,
“pl”,
split=“test”,
cache_dir=‘/media/denis/D/Datasets/Audio/’,
token=‘’
)
Environment data:
datasets-cli env
Copy-and-paste the text below in your GitHub issue.
datasets
version: 3.5.0
- Platform: Linux-6.8.0-58-generic-x86_64-with-glibc2.39
- Python version: 3.11.11
huggingface_hub
version: 0.29.3
- PyArrow version: 19.0.0
- Pandas version: 2.2.3
fsspec
version: 2024.12.0
1 Like
P.S. I just did it for Lithuanian. Apparently, it downloaded all splits and then threw away the train and dev splits. This is the output of cv_17:
Dataset({
features: [‘client_id’, ‘path’, ‘audio’, ‘sentence’, ‘up_votes’, ‘down_votes’, ‘age’, ‘gender’, ‘accent’, ‘locale’, ‘segment’, ‘variant’],
num_rows: 4753
})
1 Like
They are not thrown away. They are hidden in cache. But I still would like to load only the split that I need to save bandwidth and disk space with larger datasets.
1 Like
Hmm… It seems to work…
from datasets import load_dataset
HF_TOKEN = "hf_my_valid_read_token_***"
cv_17 = load_dataset(
"mozilla-foundation/common_voice_17_0",
"pl",
split="test",
#cache_dir='/media/denis/D/Datasets/Audio/',
#token='',
token=HF_TOKEN,
trust_remote_code=True # added
)
print(cv_17)
# Dataset({
# features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
# num_rows: 9230
# })
"""
Copy-and-paste the text below in your GitHub issue.
- `datasets` version: 3.2.0
- Platform: Windows-10-10.0.19045-SP0
- Python version: 3.9.13
- `huggingface_hub` version: 0.30.2
- PyArrow version: 18.1.0
- Pandas version: 2.2.2
- `fsspec` version: 2024.5.0
"""