How do I shuffle the dataset and assign the first N to valid and rest to train.
Here is a code snippet with what I am trying to do:
data = datasets.load_dataset("liweili/c4_200m", cache_dir="/kaggle/working/", streaming=True, split="train")\
.shuffle(seed=42, buffer_size=10_000)
c4_train = data.skip(10000)
# c4_valid = how do I assign the first 10000 that I skipped?
def group_batch(batch):
return {k: [v] for k, v in batch.items()}
train_dl = c4_train.map(group_batch, batched=True, batch_size=32)
Hi ! In streaming mode you don’t get a Dataset object but an IterableDataset. We can’t know in advance the length of an iterable dataset (e.g. to know the number of examples in a CSV file you have to download it completely).
You can set the first examples to be your validation split this way: