How to download subset of of a dataset scripted

Title: Help with Downloading a Specific Subset (Dutch) from OSCAR-2109 Dataset


Hi Hugging Face Community,

I’m new to using the Hugging Face datasets library and I’m a bit confused about how to download a specific subset of a dataset. I’ve seen in many tutorials that it is usually straightforward, something like:

dataset = load_dataset('glue', 'mrpc', split='train')

However, my use case is a bit different. I’m interested in downloading only the Dutch (nl) subset from the OSCAR-2109 dataset available at this link.
I understand that OSCAR is listed among the available datasets and there are multiple versions of it. The OSCAR-2109.py class seems to be provided for handling the dataset, but I’m not sure how to use it to download only the Dutch subset.

Could someone guide me on how to proceed with downloading only a specific part of a dataset, or point me to the resources or examples that could help me figure this out?

Thank you in advance for your help!

1 Like

Hi there @Rong-Tao,
I found that in the docs under loading text data

from datasets import load_dataset
c4_subset = load_dataset(“allenai/c4”, data_files=“en/c4-train.0000*-of-01024.json.gz”)

In your case, try and check which files you want here:

And the * pattern to gather the files you want to work with.

A bit of caution: I haven’t tried this myself yet and it might not work but I hope it provides a hint for you to solve your problem.
Best,
M

2 Likes

Hi! You can load this version of the dataset with dataset = load_dataset("oscar-corpus/OSCAR-2109", "original_nl", split="train"). Also, this dataset is gated, so you need to log in locally using the huggingface-cli login command before loading the dataset.

2 Likes

Hi @mariosasko,
Thanks for your answer. I was thinking in this direction:

ds = load_dataset(“oscar-corpus/OSCAR-2109”, data_files=“packaged/nl/nl_part_*.txt.gz”)

Would that Work too?

No, because this dataset has a loading script, but this would:

# use the HfFileSystem to specify the glob path (alternatively, a list of the files' HTTP URLs can be used instead)
ds = load_dataset("text", data_files="hf://datasets/oscar-corpus/OSCAR-2109/packaged/nl/nl_part_*.txt.gz")

However, the loading script applies additional preprocessing to the data, so the result wouldn’t be the same.

Great. Thank you

Holy, you guys are fast and nice. Thanks a lot! :hugs: :hugs: :hugs: