BPE tokenizers and spaces before words

Geo · September 7, 2023, 7:59pm

Please @boris I need your help with the special token <|endoftext|>
I think that my question at least for the token part is relevant to what is been described above.
I would like someone to clarify/disaprove the following. I have found a pretrained gpt2 model trained on the Greek language from huggingface named nikokons/gpt2-greek and I want to fine tune it on my custom dataset. My dataset consists of samples of mathematical definitions with related questions written in the Greek language. Let me give some translated examples

Definition: Two angles are called complementary angles when they sum up to 180 degrees.
Question: What are complementary angles?

Definition: Two angles are called complementary angles when they sum up to 180 degrees.
Question: How do we call two angles which sum up to 180 degrees?

Definition: A triangle is called isosceles when it has two sides of equally length.
Question: What is an isosceles triangle?

Definition: A triangle is called isosceles when it has two sides of equally length.
Question: What do we call a triangle which has two equally in length sides?

Notice that for a Definition I might have multiple questions on my dataset. I want to fine tune the model in order to learn to answer the user’s question by answering to the user with the entire Definition related to the user’s question.

What are the steps I should follow?
First fine tune the model to the raw dataset ( I mean the dataset without special tokens) in order to learn the new terminology and then preprocess the dataset in order to add in the beginning and at the ending of each sample the
|endoftext| token and finetune the model again on the new preprocessed dataset?

the processed training dataset should be like the following without starting with a space as you suggested?

|endoftext|A triangle is called isosceles when it has two sides of equally length. What is an isosceles triangle? |endoftext|Two angles are called complementary angles when they sum up to 180 degrees.
How do we call two angles which sum up to 180 degrees?|endoftext|

Also should I use padding=right when tokenizing the samples ot there is no need for that sincee from what I have read gpt2 can handle various lengths of sequences?

If I could find a complete example (with the way on how to process that dataset) for finetuning gpt2 on qna or chat would be very helpful. Basicaly I don’t know if the task I discribed earlier is in the area of conversation chat or qna

Topic		Replies	Views
`add_prefix_space=True` option for the BPE tokenizer 🤗Transformers	0	1752	October 19, 2020
Get vocabulary tokens in order to exclude them from generate function 🤗Tokenizers	2	2660	August 1, 2022
Different tokenization for the same word fed alone vs in a sentence Beginners	0	283	July 6, 2021
Use a pretrained ByteLevelBPETokenizer on text 🤗Tokenizers	1	3854	July 17, 2020
Tokenizer decoding using BERT, RoBERTa, XLNet, GPT2 Beginners	7	8581	September 21, 2020

BPE tokenizers and spaces before words

Related topics