Loading just part of dataset

Some datasets are huge, which makes it impractical to load all of it from Hf with load_dataset(), when debugging the code. Therefore one needs just load part of the dataset, say the first 10k rows. But how?

I know it is possible to load a part of dataset to memory with “slice splitting”, but it appears that it first downloads the whole dataset if it is not cached.

You can stream the dataset which doesn’t download anything, and lets you use it instantly :slight_smile:

1 Like

Thank you very much, That’s it. Great!

Yes, but you still use excess download bandwidth and then slice it on your machine in memory.

Faced this issue as well so wrote a short script that pulls a hub dataset, creates a small sample of it, and pushes the sample data to the hub as a new dataset.

def create_sample_dataset(full_dataset_name, sample_count=100, username="my-username", cache_dir="./dataset"):
    # Create a directory to save the sampled dataset
    os.makedirs(cache_dir, exist_ok=True)

    # Get the dataset name
    dataset_name = full_dataset_name.split("/")[-1]
    dataset_name_sample = f"{dataset_name}-sample-{sample_count}"

    # Load the dataset
    dataset = datasets.load_dataset(full_dataset_name, cache_dir=cache_dir)

    # Sample 100 rows from the training split (or modify for other splits)
    train_sample = dataset["train"].shuffle(seed=42).select(range(sample_count))
    test_sample = dataset["test"].shuffle(seed=42).select(range(sample_count))

    # Push to hub
    train_sample.push_to_hub(dataset_name_sample, split="train")
    print("INFO: Train split pushed to the hub successfully")

    test_sample.push_to_hub(dataset_name_sample, split="test")
    print("INFO: Test split pushed to the hub successfully")

Once sampled / pushed you have a smaller version of your dataset in the hub to pull from.

The full gist is here.

1 Like