I have the following two datasets. I am wondering if in this case the train and valid datasets are mutually exclusive consider buffer_size
is smaller than the skip/take
amounts.
data = datasets.load_dataset("liweili/c4_200m", cache_dir="/kaggle/working/", streaming=True, split="train")\
.shuffle(seed=42, buffer_size=10_000)
c4_train = data.skip(100_000)
c4_valid = data.take(100_000)