How to split a Hugging Face dataset in streaming mode without loading it into memory?

brando · May 17, 2024, 5:18pm

I’m working with Hugging Face datasets and I need to split a dataset into training and validation sets. My main requirement is that the dataset should be processed in streaming mode, as I don’t want to load the entire dataset into memory.

from datasets import load_dataset, DatasetDict

# Load a dataset from Hugging Face
dataset = load_dataset('squad', split='train')

# Split the dataset into training and validation sets
# Specify the fraction for the test set (validation set)
train_val_split = dataset.train_test_split(test_size=0.1)

# Extract the training and validation datasets
train_dataset = train_val_split['train']
val_dataset = train_val_split['test']

# Print the size of the datasets
print(f"Training set size: {len(train_dataset)}")
print(f"Validation set size: {len(val_dataset)}")

# Save the datasets if needed
# train_dataset.save_to_disk('path/to/train_dataset')
# val_dataset.save_to_disk('path/to/val_dataset')

Is there an approach to split Hugging Face datasets in streaming mode? Any suggestions or improvements to my code would be greatly appreciated.

refs:

Topic		Replies	Views
Possible to stream and create new splits? 🤗Datasets	1	395	January 4, 2024
Use load dataset to load a sample of the dataset 🤗Datasets	3	1281	May 24, 2021
Load pre-existing in-memory splits into a Dataset 🤗Datasets	2	1042	November 16, 2021
Train through multiple datasets Beginners	1	1648	June 13, 2022
Does huggingface support load raw text dataset from hdfs? 🤗Datasets	3	1294	January 9, 2022

How to split a Hugging Face dataset in streaming mode without loading it into memory?

Related topics