GPT-2 special tokens

theodp · February 7, 2024, 4:29pm

Hello,

Currently working with GPT-2, I am fine-tuning a model on Next Token Generation task in order to perform text generation at inference from an image.

During training, I manually add special token at the beginning of the sentence (BOS) and at the end (EOS). So at inference, I start with (BOS) token and let the model generate.

input_ids = [gpt2.bos_token_id] + tokens[‘input_ids’] + [gpt2.eos_token_id]

However, I realized that [gpt2.bos_token_id] and [gpt2.eos_token_id] have the same ID (0 : ‘<|endoftext|>’) so there are the same tokens. Why is it done like it ? Not like in BERT with different tokens for BOS and EOS.

Morever, is it a problem for my generation at inference ?

Thank you !

theodp · February 20, 2024, 1:39pm

Update: It is not a problem !

system · February 21, 2024, 1:39am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GPT2Tokenizer not putting bos/eos token Intermediate	3	5474	March 31, 2024
How to efficiently tokenize unknown tokens in GPT2 Intermediate	0	1008	January 12, 2022
What is the correct format of input when fine-tuning GPT2 for text generation with batch input? Models	0	506	January 22, 2024
How to make GPT2 Tokenizer actually add special tokens Beginners	4	3017	February 28, 2025
Understanding <bos> Token in GPT2 Training Models	0	427	October 16, 2020

GPT-2 special tokens

Related topics