I’m working on fine-tuning GPT-2 using the Hugging Face Transformers library. I need to tokenize a set of input sequences, append the EOS token, and pack these sequences into batches without exceeding a specified max_length
. Additionally, I need to ensure the EOS tokens are correctly handled to avoid padding issues or misinterpretation by the model.
Here’s what I’m trying to achieve:
-
- Tokenization: Tokenize input sequences, appending the EOS token at the end of each sequence.
-
- Sequence Packing: Pack these tokenized sequences into batches, ensuring the total length doesn’t exceed the
max_length
.
- Sequence Packing: Pack these tokenized sequences into batches, ensuring the total length doesn’t exceed the
-
- EOS Handling: Properly handle EOS tokens to ensure they are correctly recognized and don’t cause padding issues.
- I’ve read that some tokenizers might prepend an underscore or spaces or handle special tokens differently, so I want to ensure I’m handling these edge cases correctly.
Full Working Example
Below is a self-contained example using GPT-2:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from itertools import chain
# Initialize GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
# Example dataset
dataset = [
"This is the first sentence.",
"This is the second sentence, which is a bit longer.",
"And here is the third sentence."
]
max_length = 20 # Max sequence length for each batch
def group_texts(examples, block_size):
"""
Concatenate texts, split into chunks of block_size, and ensure EOS token handling.
"""
concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
total_length = (total_length // block_size) * block_size # round down to block_size
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
def tokenize_and_pack(dataset, tokenizer, max_length):
"""
Tokenize sequences, append EOS token, and pack into batches.
"""
tokenized_sequences = []
current_batch = []
current_length = 0
for string in dataset:
# Tokenize and append EOS token
tokenized_string = tokenizer.encode(string + tokenizer.eos_token)
token_length = len(tokenized_string)
if current_length + token_length > max_length:
# Start a new batch if adding the string exceeds max_length
tokenized_sequences.append(current_batch)
current_batch = tokenized_string
current_length = token_length
else:
# Add to the current batch
current_batch += tokenized_string
current_length += token_length
# Add the last batch
if current_batch:
tokenized_sequences.append(current_batch)
return tokenized_sequences
# Tokenize and pack the dataset
tokenized_batches = tokenize_and_pack(dataset, tokenizer, max_length)
# Convert to tensors for model input
tokenized_batches_tensors = [torch.tensor(batch) for batch in tokenized_batches]
# Example output
for i, batch in enumerate(tokenized_batches_tensors):
print(f"Batch {i + 1}: {batch.tolist()}")
Questions:
- Is this the correct approach to tokenizing and packing sequences for fine-tuning GPT-2?
- How can I ensure that EOS tokens are correctly handled, especially when sequences are split across batches?
- Are there any best practices in the Hugging Face ecosystem for dealing with sequence packing and EOS token placement, particularly when fine-tuning on large datasets?
I want to make sure that I’m not inadvertently causing issues with the EOS token placement or sequence packing, which could lead to suboptimal fine-tuning results.
Any advice or improvements would be greatly appreciated!
I’m trying to improve on this solution by packing more seqs + putting eos in the right place: machine learning - How does one set the pad token correctly (not to eos) during fine-tuning to avoid model not predicting EOS? - Stack Overflow