One-to-many augmentations on the fly

pbelcak · March 27, 2023, 3:24pm

Hi,

I have a large dataset of long documents that I would like stream for training. The problem is that my model needs inputs of at most 512 tokens at a time, so I need to split/portion/partition each document into pieces of 512 tokens.

Is there a way to transform the rows of a HF Dataset on the fly such that one document may be augmented/transformed into several samples at once?

Both .map() and .set_transform() seem to account only for one-to-one transformations.

mariosasko · March 28, 2023, 3:47pm

Hi! You can use .map in the batched mode to transform a single example into multiple ones. You can find examples of such transforms in our docs and in the course.

pbelcak · March 29, 2023, 8:11am

But .map() does so ahead of time and not on the fly, correct?

mariosasko · March 29, 2023, 5:17pm

Correct!

Considering the number of input examples needs to match the number of output examples in set_transform, to do this on the fly, you would have to partition each document into a list of strings/tokens in the transform and then additionally do some postprocessing while iterating over this transformed dataset to turn these lists of strings/tokens into individual examples before finally passing them to the model.

pbelcak · March 30, 2023, 5:45pm

Okay, thanks for the response, so short answer – there is no way to do this with the huggingface Dataset’s library. I feel like, however, it should not be too hard to write a batched extension to .set_transform(), and I could perhaps give it a go. Any starting points or reasons not to try?

mariosasko · March 31, 2023, 1:49pm

You can define a transform that outputs more examples than it gets, but then indexing a single example (dataset[i]) will not behave as expected as we don’t know the “example offsets”, which shouldn’t be a problem if you only read examples in the batched manner (dataset[i:j]).

pbelcak · April 6, 2023, 2:06pm

But then I end up having un-even batch sizes, right? That would make the memory requirements of my training loop unpredictable.

Topic		Replies	Views
Pass `Dataset.map` result to model Beginners	2	1102	April 4, 2023
Making multiple samples from single samples using HuggingFace Datasets 🤗Datasets	6	2666	March 3, 2024
Use existing Dataset with a generator 🤗Datasets	4	809	April 13, 2023
How to perform unbatch operation with huggingface datasets 🤗Datasets	1	692	August 16, 2021
Weird example of batching in Dataset.map document 🤗Datasets	4	1041	September 4, 2023

One-to-many augmentations on the fly

Related topics