Splitting dataset via length

sachin · September 1, 2022, 9:38am

So there is a few parts to this question.

How do I find the length of entire dataset?
How do I shuffle the dataset and assign the first N to valid and rest to train.

Here is a code snippet with what I am trying to do:

data = datasets.load_dataset("liweili/c4_200m", cache_dir="/kaggle/working/", streaming=True, split="train")\
        .shuffle(seed=42, buffer_size=10_000)
c4_train = data.skip(10000)
# c4_valid = how do I assign the first 10000 that I skipped?
def group_batch(batch):
    return {k: [v] for k, v in batch.items()}
train_dl = c4_train.map(group_batch, batched=True, batch_size=32)

len isn’t working on any of these objects

rwheel · September 1, 2022, 9:59am

Hi @sachin ,

Using your code as example:

data.shape
(Doc: Main classes)

data.train_test_split(test_size=0.1)

For this last question, take a look to this documentation: Process

lhoestq · September 1, 2022, 1:36pm

Hi ! In streaming mode you don’t get a Dataset object but an IterableDataset. We can’t know in advance the length of an iterable dataset (e.g. to know the number of examples in a CSV file you have to download it completely).

You can set the first examples to be your validation split this way:

c4_train = data.skip(10000)
c4_valid = data.take(10000)

rwheel · September 1, 2022, 4:06pm

You’re right! I didn’t realize that he was loading the dataset in streaming mode. Sorry!

Topic		Replies	Views
`buffer_size` argument and train/ valid splits 🤗Datasets	1	460	September 5, 2022
Shuffling and buffer size 🤗Datasets	1	839	October 3, 2023
How to use split_dataset_by_node and shuffle on iterable dataset 🤗Datasets	3	543	February 17, 2025
How to create a train test split for an iterable dataset 🤗Datasets	1	1274	June 6, 2023
Streaming Dataset of Sequence Length 2048 Intermediate	7	2794	May 12, 2022

Splitting dataset via length

Related topics