Small python dataset

Greetings,
Iā€™d like to do simple experiments on a programming language dataset, so Iā€™m looking for something like code-parrot but way smaller.

Iā€™m new to datasets library, so wanted to ask - is there a way to automatically get a small fraction of existing dataset, e.g. by adding some flag or ā€œsmallā€ to the dataset name?

Also a dataset I found on the HFHub appears broken - should this be reported as an issue?:

from datasets import load_dataset

dataset = load_dataset("formermagic/github_python_1m")  # Error: 
# FileNotFoundError: Couldn't find a dataset script at /Users/USERNAME/PycharmProjects/parrot/formermagic/github_python_1m/github_python_1m.py or any data file in the same directory. Couldn't find 'formermagic/github_python_1m' on the Hugging Face Hub either: FileNotFoundError: The dataset repository at 'formermagic/github_python_1m' doesn't contain any data file.

Cheers,
Ilya

Hi Ilya

Yes, itā€™s possible to load only a fraction of a dataset. Check out this documentation for various options: Load

Regarding the dataset that is broken - it seems indeed that there is neither a data file nor a loading script in the repo: formermagic/github_python_1m at main

A valid dataset should have one or the other (or both), see here: Share

In this case you could reach out to the publisher and see if they could rectify this.

Hope this helps!

Cheers
Heiko

1 Like