Packing multiple samples into context window

Rong-Tao · December 25, 2023, 7:01am

Dear Hugging face community,
I am searching for a way to pack multiple samples into a single sample in a more dynamic way?
As stated in GPT3 paper:

During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency. Sequences with multiple documents are not masked in any special way but instead documents within a sequence are delimited with a special end of text token, giving the language model the information necessary to infer that context separated by the end of text token is unrelated. This allows for efficient training without need for any special sequence-specific masking.

I currently just brute force truncate or pad to max_len

def tokenize_function(examples):
    text_column_name = 'text'
    return tokenizer(
        examples[text_column_name],
        padding = 'max_length',
        truncation = True,
        max_length = 512
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched = True,
    num_proc = 24
)

And I want to improve on that
I wonder if there is any ‘elegant’ solution in hugging face?
I found a solution at run_mlm.py group_texts(), it basically concat several batches and treate them as if it is one. However I do think this is a bit data inefficient, we could be throwing away lots of data this way. Is there an already implemented policy to achieve this? say, pack until the next example does not fit or keep a pool and assemble samples from pool to achieve even better results?

lhoestq · January 12, 2024, 2:46pm

I know it was asked in a transformers issue, but hasn’t had responses so far: Input Packing · Issue #17726 · huggingface/transformers · GitHub

Topic		Replies	Views
Looking for "How-to" on training with multiple files Beginners	1	23	March 1, 2025
Chapter 2 questions Course	98	9123	June 1, 2025
Customizing the ordering of training samples 🤗Transformers	1	815	September 16, 2021
Question on splitting input sequence Beginners	3	5570	June 14, 2022
Passing list of inputs to tokenize 🤗Tokenizers	1	1336	May 9, 2022

Packing multiple samples into context window

Related topics