Loading just part of dataset

majdoddin · May 2, 2023, 7:46pm

Some datasets are huge, which makes it impractical to load all of it from Hf with load_dataset(), when debugging the code. Therefore one needs just load part of the dataset, say the first 10k rows. But how?

I know it is possible to load a part of dataset to memory with “slice splitting”, but it appears that it first downloads the whole dataset if it is not cached.

stevhliu · May 2, 2023, 8:31pm

You can stream the dataset which doesn’t download anything, and lets you use it instantly

majdoddin · May 2, 2023, 8:59pm

Thank you very much, That’s it. Great!

pbk0 · May 5, 2024, 3:22am

Yes, but you still use excess download bandwidth and then slice it on your machine in memory.

neonwatty · February 25, 2025, 2:48pm

Faced this issue as well so wrote a short script that pulls a hub dataset, creates a small sample of it, and pushes the sample data to the hub as a new dataset.

def create_sample_dataset(full_dataset_name, sample_count=100, username="my-username", cache_dir="./dataset"):
    # Create a directory to save the sampled dataset
    os.makedirs(cache_dir, exist_ok=True)

    # Get the dataset name
    dataset_name = full_dataset_name.split("/")[-1]
    dataset_name_sample = f"{dataset_name}-sample-{sample_count}"

    # Load the dataset
    dataset = datasets.load_dataset(full_dataset_name, cache_dir=cache_dir)

    # Sample 100 rows from the training split (or modify for other splits)
    train_sample = dataset["train"].shuffle(seed=42).select(range(sample_count))
    test_sample = dataset["test"].shuffle(seed=42).select(range(sample_count))

    # Push to hub
    train_sample.push_to_hub(dataset_name_sample, split="train")
    print("INFO: Train split pushed to the hub successfully")

    test_sample.push_to_hub(dataset_name_sample, split="test")
    print("INFO: Test split pushed to the hub successfully")

Once sampled / pushed you have a smaller version of your dataset in the hub to pull from.

The full gist is here.

Topic		Replies	Views
Download only a subset of a split 🤗Datasets	10	16937	February 25, 2025
How do i load part of the data set Beginners	3	95	May 5, 2025
Use load dataset to load a sample of the dataset 🤗Datasets	3	1268	May 24, 2021
Download a fraction of data from HuggingFace Datasets 🤗Datasets	4	301	November 20, 2024
Loading a fraction of data 🤗Datasets	5	5340	May 12, 2023

Loading just part of dataset

Related topics