Small python dataset

ikamensh · May 7, 2022, 8:54am

Greetings,
I’d like to do simple experiments on a programming language dataset, so I’m looking for something like code-parrot but way smaller.

I’m new to datasets library, so wanted to ask - is there a way to automatically get a small fraction of existing dataset, e.g. by adding some flag or “small” to the dataset name?

Also a dataset I found on the HFHub appears broken - should this be reported as an issue?:

from datasets import load_dataset

dataset = load_dataset("formermagic/github_python_1m")  # Error: 
# FileNotFoundError: Couldn't find a dataset script at /Users/USERNAME/PycharmProjects/parrot/formermagic/github_python_1m/github_python_1m.py or any data file in the same directory. Couldn't find 'formermagic/github_python_1m' on the Hugging Face Hub either: FileNotFoundError: The dataset repository at 'formermagic/github_python_1m' doesn't contain any data file.

Cheers,
Ilya

marshmellow77 · May 7, 2022, 3:53pm

Hi Ilya

Yes, it’s possible to load only a fraction of a dataset. Check out this documentation for various options: Load

Regarding the dataset that is broken - it seems indeed that there is neither a data file nor a loading script in the repo: formermagic/github_python_1m at main

A valid dataset should have one or the other (or both), see here: Share

In this case you could reach out to the publisher and see if they could rectify this.

Hope this helps!

Cheers
Heiko

Topic		Replies	Views
Using load_datasets for newly created datasets 🤗Datasets	2	456	August 27, 2021
Download only a subset of a split 🤗Datasets	10	16769	February 25, 2025
Loading just part of dataset 🤗Datasets	4	4819	February 25, 2025
Loading a fraction of data 🤗Datasets	5	5296	May 12, 2023
Download a fraction of data from HuggingFace Datasets 🤗Datasets	4	287	November 20, 2024

Small python dataset

Related topics