How to download subset of of a dataset scripted

Rong-Tao · December 6, 2023, 1:05pm

Title: Help with Downloading a Specific Subset (Dutch) from OSCAR-2109 Dataset

Hi Hugging Face Community,

I’m new to using the Hugging Face datasets library and I’m a bit confused about how to download a specific subset of a dataset. I’ve seen in many tutorials that it is usually straightforward, something like:

dataset = load_dataset('glue', 'mrpc', split='train')

However, my use case is a bit different. I’m interested in downloading only the Dutch (nl) subset from the OSCAR-2109 dataset available at this link.
I understand that OSCAR is listed among the available datasets and there are multiple versions of it. The OSCAR-2109.py class seems to be provided for handling the dataset, but I’m not sure how to use it to download only the Dutch subset.

Could someone guide me on how to proceed with downloading only a specific part of a dataset, or point me to the resources or examples that could help me figure this out?

Thank you in advance for your help!

mikehemberger · December 6, 2023, 4:05pm

Hi there @Rong-Tao,
I found that in the docs under loading text data

from datasets import load_dataset
c4_subset = load_dataset(“allenai/c4”, data_files=“en/c4-train.0000*-of-01024.json.gz”)

In your case, try and check which files you want here:

And the * pattern to gather the files you want to work with.

A bit of caution: I haven’t tried this myself yet and it might not work but I hope it provides a hint for you to solve your problem.
Best,
M

mariosasko · December 6, 2023, 4:18pm

Hi! You can load this version of the dataset with dataset = load_dataset("oscar-corpus/OSCAR-2109", "original_nl", split="train"). Also, this dataset is gated, so you need to log in locally using the huggingface-cli login command before loading the dataset.

mikehemberger · December 6, 2023, 4:51pm

Hi @mariosasko,
Thanks for your answer. I was thinking in this direction:

ds = load_dataset(“oscar-corpus/OSCAR-2109”, data_files=“packaged/nl/nl_part_*.txt.gz”)

Would that Work too?

mariosasko · December 6, 2023, 5:33pm

No, because this dataset has a loading script, but this would:

# use the HfFileSystem to specify the glob path (alternatively, a list of the files' HTTP URLs can be used instead)
ds = load_dataset("text", data_files="hf://datasets/oscar-corpus/OSCAR-2109/packaged/nl/nl_part_*.txt.gz")

However, the loading script applies additional preprocessing to the data, so the result wouldn’t be the same.

mikehemberger · December 6, 2023, 5:45pm

Great. Thank you

Rong-Tao · December 7, 2023, 1:31am

Holy, you guys are fast and nice. Thanks a lot!

Topic		Replies	Views
Loading a fraction of data 🤗Datasets	5	5245	May 12, 2023
Accessing to OSCAR data-set 🤗Datasets	1	1266	July 6, 2021
Download a fraction of data from HuggingFace Datasets 🤗Datasets	4	281	November 20, 2024
Is there any ways to download only a subset of dataset using huggingface-cli? 🤗Hub	0	274	July 17, 2024
Download only a subset of a split 🤗Datasets	10	16565	February 25, 2025

How to download subset of of a dataset scripted

Related topics