I want to run some experiments using data from the pile, but don’t have nearly enough space for that much data. Is there an easy way to download only a small portion of the dataset?
1 Like
Try:
from datasets import load_dataset
num_samples_to_take = 1000
dataset_name = "EleutherAI/pile"
ds = load_dataset(dataset_name, "subset_name", split="train", streaming=True) # subset names: ['all', 'enron_emails', 'europarl', 'free_law', 'hacker_news', 'nih_exporter', 'pubmed', 'pubmed_central', 'ubuntu_irc', 'uspto', 'github']
ds = ds.take(num_samples_to_take)
1 Like