`buffer_size` argument and train/ valid splits

sachin · September 3, 2022, 9:39am

I have the following two datasets. I am wondering if in this case the train and valid datasets are mutually exclusive consider buffer_size is smaller than the skip/take amounts.

data = datasets.load_dataset("liweili/c4_200m", cache_dir="/kaggle/working/", streaming=True, split="train")\
        .shuffle(seed=42, buffer_size=10_000)
c4_train = data.skip(100_000)
c4_valid = data.take(100_000)

lhoestq · September 5, 2022, 8:55am

Yes they are exclusive

The buffer is only used to locally shuffle the data

Topic		Replies	Views
Shuffling and buffer size 🤗Datasets	1	833	October 3, 2023
Splitting dataset via length 🤗Datasets	3	1726	September 1, 2022
Not declaring splits inside of dataset loading script 🤗Datasets	2	1596	July 28, 2022
Load_dataset assumes 'train' Beginners	2	931	May 31, 2023
Download only a subset of a split 🤗Datasets	10	16465	February 25, 2025

`buffer_size` argument and train/ valid splits

Related topics