How to properly tokenize and pack sequences with EOS token handling for GPT-2 fine-tuning in Hugging Face Transformers?

brando · August 21, 2024, 4:29pm

I will check it out! (and was told about it yesterday!) But how does it handle the edge cases where the <eos> might not be tokenized correctly when it’s next to a space for example? I was anecdotally told this can be a severe issue.

I’d personally put an assert to the tokenization in ds.map asserting that the eos_id does appear in the “right” place.

Topic		Replies	Views
What is the correct format of input when fine-tuning GPT2 for text generation with batch input? Models	0	506	January 22, 2024
Infinity output from gpt2 model? Beginners	2	153	June 22, 2024
Make correct padding for text generation with GPT-NEO 🤗Tokenizers	0	820	July 5, 2023
Training GPT-2 from scratch Beginners	2	1228	August 3, 2020
How to set the padding configuration with Huggingface's GenerateMixin's generate method? Intermediate	7	11173	September 26, 2023

How to properly tokenize and pack sequences with EOS token handling for GPT-2 fine-tuning in Hugging Face Transformers?

Related topics