I have a large dataset of long documents that I would like stream for training. The problem is that my model needs inputs of at most 512 tokens at a time, so I need to split/portion/partition each document into pieces of 512 tokens.
Is there a way to transform the rows of a HF Dataset on the fly such that one document may be augmented/transformed into several samples at once?
.set_transform() seem to account only for one-to-one transformations.
Hi! You can use
.map in the
batched mode to transform a single example into multiple ones. You can find examples of such transforms in our docs and in the course.
.map() does so ahead of time and not on the fly, correct?
Considering the number of input examples needs to match the number of output examples in
set_transform, to do this on the fly, you would have to partition each document into a list of strings/tokens in the transform and then additionally do some postprocessing while iterating over this transformed dataset to turn these lists of strings/tokens into individual examples before finally passing them to the model.
Okay, thanks for the response, so short answer – there is no way to do this with the huggingface Dataset’s library. I feel like, however, it should not be too hard to write a batched extension to
.set_transform(), and I could perhaps give it a go. Any starting points or reasons not to try?
You can define a transform that outputs more examples than it gets, but then indexing a single example (
dataset[i]) will not behave as expected as we don’t know the “example offsets”, which shouldn’t be a problem if you only read examples in the batched manner (
But then I end up having un-even batch sizes, right? That would make the memory requirements of my training loop unpredictable.