Download only a subset of a split

morenolq · April 3, 2022, 9:22am

Hi,

I was wondering if is there a way to download only part of the data of a dataset.
In my specific case, I need to download only X samples from oscar English split (X~100K samples).
When I try to invoke the dataset builder it asks for >1TB of space so I think it will download the full set of data at the beginning.

merve · April 4, 2022, 10:36am

Hello

You can load a part of split by slicing:

train_10_20_ds = datasets.load_dataset('bookcorpus', split='train[10:20]')

You can refer to more ways of slicing and loading here.

morenolq · April 4, 2022, 4:41pm

Thank you, can I also use streaming mode to reach the same?

lhoestq · April 7, 2022, 9:27am

Hi, let me just complete this: split='train[10:20]' returns a slice of the data, but it still downloads everything.

If your dataset is too big, please use streaming mode. You can also slice your dataset in streaming mode, see the documentation here: Stream

morenolq · April 7, 2022, 12:27pm

Thank you, that’s the case indeed.

vesuppi · April 19, 2023, 9:40pm

When I stream a dataset that’s too big, it always seems to get stuck after certain point (program hanging and not making progress), like after 10000 samples etc. After a long time it says something like reconnect to data host. Any reasons for that? Thanks!

lhoestq · May 2, 2023, 12:38pm

It depends on the host. Some datasets are hosted on HF, but some others have their data files hosted on the original dataset author/platform. You can check how the dataset is loaded by checking its repository on HF. Which dataset did you try to load ?

lucasjin · June 19, 2024, 2:00pm

same issue here

i just need 10-20 parquet from 67069567 all…

i zabnit diwnload tehm all

lhoestq · June 21, 2024, 10:55am

Datasets are now generally hosted on HF, you can pass the data_files= argument to load_dataset to only load a subset of the data in the datasets lib

srinjoyMukherjee · November 20, 2024, 8:42am

thanks for the answer, but what if I want to just specify the number of files I want without actually listing their names, is it possible to do that? lets say I just want to download 20 data files without listing their names in data_files

Thanks in advance

neonwatty · February 25, 2025, 2:50pm

Faced this issue as well so wrote a short script that pulls a hub dataset, creates a small sample of it, and pushes the sample data to the hub as a new dataset.

def create_sample_dataset(full_dataset_name, sample_count=100, username="my-username", cache_dir="./dataset"):
    # Create a directory to save the sampled dataset
    os.makedirs(cache_dir, exist_ok=True)

    # Get the dataset name
    dataset_name = full_dataset_name.split("/")[-1]
    dataset_name_sample = f"{dataset_name}-sample-{sample_count}"

    # Load the dataset
    dataset = datasets.load_dataset(full_dataset_name, cache_dir=cache_dir)

    # Sample 100 rows from the training split (or modify for other splits)
    train_sample = dataset["train"].shuffle(seed=42).select(range(sample_count))
    test_sample = dataset["test"].shuffle(seed=42).select(range(sample_count))

    # Push to hub
    train_sample.push_to_hub(dataset_name_sample, split="train")
    print("INFO: Train split pushed to the hub successfully")

    test_sample.push_to_hub(dataset_name_sample, split="test")
    print("INFO: Test split pushed to the hub successfully")

The full gist is here.

Topic		Replies	Views
Loading just part of dataset 🤗Datasets	4	4778	February 25, 2025
Loading a fraction of data 🤗Datasets	5	5268	May 12, 2023
How do i load part of the data set Beginners	3	88	May 5, 2025
Downloading a portion of parquet files 🤗Datasets	3	660	May 23, 2024
Streaming in dataset uploads 🤗Datasets	2	52	March 31, 2025

Download only a subset of a split

Related topics