Training GPT-2 from scratch

miguelvictor · August 2, 2020, 7:22am

Hello!

I’m currently working on a toy project that uses GPT-2 (smallest variant but only 6 layers, from scratch) to predict next tokens in the context of programming languages. So my dataset are all source codes and I am using a custom tokenizer and i have the following questions:

If my sample is longer than 1024 tokens (supposing the model’s max length is 1024), is the past tokens automatically fed back to the model during training? or should I do it myself?
My custom tokenizer works well (in my opinion) but i want to use the huggingface API to take advantage of the “fast” tokenizers. How do I go about subclassing the Tokenizer class so that my tokenizer is compatible with huggingface’s tokenizer api ?

Thank you very much!!!

valhalla · August 2, 2020, 2:31pm

Hi @miguelvictor,

You can train you tokenizer using the`tokenizers library. These are fast rust tokenizers with python API.

miguelvictor · August 3, 2020, 3:41pm

Hello. Thank you for replying.

Do you have any ideas about my question#1? I tried looking at the source code and it seems no “past” is given to the model during training for longer sequences (or maybe I’m wrong)

Topic		Replies	Views
GPT2 Training from scratch in German 🤗Transformers	3	2312	October 3, 2020
Train gpt-2 from scratch in Italian Beginners	0	880	September 8, 2022
Building a GPT2 dataset from long sequences 🤗Datasets	1	517	September 19, 2022
Can't figure out how to implement gpt2 tokenizer in fine-tuning Beginners	0	330	July 22, 2022
How to train gpt-2 from scratch? (no fine-tuning) Beginners	17	19043	December 14, 2022

Training GPT-2 from scratch

Related topics