I’m currently working on a toy project that uses GPT-2 (smallest variant but only 6 layers, from scratch) to predict next tokens in the context of programming languages. So my dataset are all source codes and I am using a custom tokenizer and i have the following questions:
- If my sample is longer than 1024 tokens (supposing the model’s max length is 1024), is the past tokens automatically fed back to the model during training? or should I do it myself?
- My custom tokenizer works well (in my opinion) but i want to use the huggingface API to take advantage of the “fast” tokenizers. How do I go about subclassing the Tokenizer class so that my tokenizer is compatible with huggingface’s tokenizer api ?
Thank you very much!!!