Loading a part of a dataset from a specified feature value

Hi there!

I want to load just the ‘tweets’ part of this dataset: https://huggingface.co/datasets/jorgeortizfuentes/chilean-spanish-corpus/viewer/default/train?p=100000

In another words, I wanted to know if there an option to specify to huggingface that i just want the rows where the ‘source’ = ‘twitter’. I didn’t know if there was a way to do this from the load_dataset() method. Any guidance would be super helpful. I am trying to create word embeddings with a word that only occurs in informal speech, so I do not need the rest of the dataset. I wanted to be able to load just the tweets in quickest way possible.

Thanks,

Joe

1 Like

Hi ! Yes it’s possible to pass filters= to load_dataset for Parquet datasets since… yesterday :stuck_out_tongue:

See datasets 3.2 release notes: Release 3.2.0 · huggingface/datasets · GitHub

Example:

from datasets import load_dataset
filters = [('date', '>=', '2023')]
ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
1 Like