Splitting dataset via length

So there is a few parts to this question.

  1. How do I find the length of entire dataset?
  2. How do I shuffle the dataset and assign the first N to valid and rest to train.

Here is a code snippet with what I am trying to do:

data = datasets.load_dataset("liweili/c4_200m", cache_dir="/kaggle/working/", streaming=True, split="train")\
        .shuffle(seed=42, buffer_size=10_000)
c4_train = data.skip(10000)
# c4_valid = how do I assign the first 10000 that I skipped?
def group_batch(batch):
    return {k: [v] for k, v in batch.items()}
train_dl = c4_train.map(group_batch, batched=True, batch_size=32)

len isn’t working on any of these objects :frowning:

Hi @sachin ,

Using your code as example:

data.shape
(Doc: Main classes)

data.train_test_split(test_size=0.1)

For this last question, take a look to this documentation: Process

Hi ! In streaming mode you don’t get a Dataset object but an IterableDataset. We can’t know in advance the length of an iterable dataset (e.g. to know the number of examples in a CSV file you have to download it completely).

You can set the first examples to be your validation split this way:

c4_train = data.skip(10000)
c4_valid = data.take(10000)
1 Like

You’re right! I didn’t realize that he was loading the dataset in streaming mode. Sorry!