Training GPT-2 from scratch


I’m currently working on a toy project that uses GPT-2 (smallest variant but only 6 layers, from scratch) to predict next tokens in the context of programming languages. So my dataset are all source codes and I am using a custom tokenizer and i have the following questions:

  1. If my sample is longer than 1024 tokens (supposing the model’s max length is 1024), is the past tokens automatically fed back to the model during training? or should I do it myself?
  2. My custom tokenizer works well (in my opinion) but i want to use the huggingface API to take advantage of the “fast” tokenizers. How do I go about subclassing the Tokenizer class so that my tokenizer is compatible with huggingface’s tokenizer api ?

Thank you very much!!!

Hi @miguelvictor,

  1. You can train you tokenizer using the`tokenizers library. These are fast rust tokenizers with python API.

Hello. Thank you for replying.

Do you have any ideas about my question#1? I tried looking at the source code and it seems no “past” is given to the model during training for longer sequences (or maybe I’m wrong)