Loading a part of a dataset from a specified feature value

jelarson · December 7, 2024, 8:54pm

Hi there!

I want to load just the ‘tweets’ part of this dataset: https://huggingface.co/datasets/jorgeortizfuentes/chilean-spanish-corpus/viewer/default/train?p=100000

In another words, I wanted to know if there an option to specify to huggingface that i just want the rows where the ‘source’ = ‘twitter’. I didn’t know if there was a way to do this from the load_dataset() method. Any guidance would be super helpful. I am trying to create word embeddings with a word that only occurs in informal speech, so I do not need the rest of the dataset. I wanted to be able to load just the tweets in quickest way possible.

Thanks,

Joe

lhoestq · December 11, 2024, 3:46pm

Hi ! Yes it’s possible to pass filters= to load_dataset for Parquet datasets since… yesterday

See datasets 3.2 release notes: Release 3.2.0 · huggingface/datasets · GitHub

Example:

from datasets import load_dataset
filters = [('date', '>=', '2023')]
ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)

Topic		Replies	Views
How do i load part of the data set Beginners	3	92	May 5, 2025
Download only a subset of a split 🤗Datasets	10	16855	February 25, 2025
Downloading a portion of parquet files 🤗Datasets	3	675	May 23, 2024
Load a subset of a dataset 🤗Datasets	2	1843	April 19, 2023
Filter Large Dataset Entry by Entry 🤗Datasets	7	172	August 28, 2024

Loading a part of a dataset from a specified feature value

Related topics