`buffer_size` argument and train/ valid splits

I have the following two datasets. I am wondering if in this case the train and valid datasets are mutually exclusive consider buffer_size is smaller than the skip/take amounts.

data = datasets.load_dataset("liweili/c4_200m", cache_dir="/kaggle/working/", streaming=True, split="train")\
        .shuffle(seed=42, buffer_size=10_000)
c4_train = data.skip(100_000)
c4_valid = data.take(100_000) 

Yes they are exclusive :slight_smile:

The buffer is only used to locally shuffle the data