I’m exploring using streaming datasets with a function that preprocesses the text, tokenizes it into training samples, and then applies some noise to the input_ids (à la BART pretraining). It seems to be working really well, and saves a huge amount of disk space compared to downloading a dataset like OSCAR locally.
Since a lot of the examples in OSCAR are much longer than my model’s max size, I’ve been truncating each example to the final whitespace at the end of the first model-size chunk, and throwing a way a ton of data. Not the end of the world, but it feels… wasteful.
I took a look at how
MappedExamplesIterable handles batching, and I had a realization. Since
__iter__ fetches a batch from the dataset and then just yields each output of the mapped function, there’s no reason the number of processed results needs to be the same as the batch size, right?
The preprocessing function could split the longer examples into smaller chunks, and batch could yield any number of processed examples. It looks like the only thing
batch_size is used for is pulling chunks of data from the cloud, and nothing downstream will care how many examples are returned, because they’re yielded one at a time. So a batch in
batch_size=100 could have 100, or 110, or 3000 or however many examples.
The only downside I see is not knowing how many total examples I’ll have to work with. But with a streaming dataset, I have to train with a predefined
max_steps anyway, so that doesn’t seem so bad.
Am I understanding this correctly?