How to properly tokenize and pack sequences with EOS token handling for GPT-2 fine-tuning in Hugging Face Transformers?

I’m working on fine-tuning GPT-2 using the Hugging Face Transformers library. I need to tokenize a set of input sequences, append the EOS token, and pack these sequences into batches without exceeding a specified max_length. Additionally, I need to ensure the EOS tokens are correctly handled to avoid padding issues or misinterpretation by the model.

Here’s what I’m trying to achieve:

    • Tokenization: Tokenize input sequences, appending the EOS token at the end of each sequence.
    • Sequence Packing: Pack these tokenized sequences into batches, ensuring the total length doesn’t exceed the max_length.
    • EOS Handling: Properly handle EOS tokens to ensure they are correctly recognized and don’t cause padding issues.
  1. I’ve read that some tokenizers might prepend an underscore or spaces or handle special tokens differently, so I want to ensure I’m handling these edge cases correctly.

Full Working Example

Below is a self-contained example using GPT-2:

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from itertools import chain

# Initialize GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Example dataset
dataset = [
    "This is the first sentence.",
    "This is the second sentence, which is a bit longer.",
    "And here is the third sentence."
]

max_length = 20  # Max sequence length for each batch

def group_texts(examples, block_size):
    """
    Concatenate texts, split into chunks of block_size, and ensure EOS token handling.
    """
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size  # round down to block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

def tokenize_and_pack(dataset, tokenizer, max_length):
    """
    Tokenize sequences, append EOS token, and pack into batches.
    """
    tokenized_sequences = []
    current_batch = []
    current_length = 0

    for string in dataset:
        # Tokenize and append EOS token
        tokenized_string = tokenizer.encode(string + tokenizer.eos_token)
        token_length = len(tokenized_string)

        if current_length + token_length > max_length:
            # Start a new batch if adding the string exceeds max_length
            tokenized_sequences.append(current_batch)
            current_batch = tokenized_string
            current_length = token_length
        else:
            # Add to the current batch
            current_batch += tokenized_string
            current_length += token_length

    # Add the last batch
    if current_batch:
        tokenized_sequences.append(current_batch)

    return tokenized_sequences

# Tokenize and pack the dataset
tokenized_batches = tokenize_and_pack(dataset, tokenizer, max_length)

# Convert to tensors for model input
tokenized_batches_tensors = [torch.tensor(batch) for batch in tokenized_batches]

# Example output
for i, batch in enumerate(tokenized_batches_tensors):
    print(f"Batch {i + 1}: {batch.tolist()}")

Questions:

  1. Is this the correct approach to tokenizing and packing sequences for fine-tuning GPT-2?
  2. How can I ensure that EOS tokens are correctly handled, especially when sequences are split across batches?
  3. Are there any best practices in the Hugging Face ecosystem for dealing with sequence packing and EOS token placement, particularly when fine-tuning on large datasets?

I want to make sure that I’m not inadvertently causing issues with the EOS token placement or sequence packing, which could lead to suboptimal fine-tuning results.

Any advice or improvements would be greatly appreciated!

I’m trying to improve on this solution by packing more seqs + putting eos in the right place: machine learning - How does one set the pad token correctly (not to eos) during fine-tuning to avoid model not predicting EOS? - Stack Overflow

cross: huggingface - How to properly tokenize and pack sequences with EOS token handling for GPT-2 fine-tuning in Hugging Face Transformers? - Stack Overflow

Hey! You can also use TRL’s SFTTrainer that handles packing for you :slight_smile:

For more details, refer to the docs: Supervised Fine-tuning Trainer

1 Like

I will check it out! (and was told about it yesterday!) But how does it handle the edge cases where the <eos> might not be tokenized correctly when it’s next to a space for example? I was anecdotally told this can be a severe issue.

I’d personally put an assert to the tokenization in ds.map asserting that the eos_id does appear in the “right” place.