Simple example of Transformer from scratch?

Is there a full example of how to train an extremely small/simple transformer model (e.g. GPTNeo with only a hundred parameters) entirely from scratch? I’m trying to do this just for learning purposes but I keep getting CUDA errors. It can’t be that I’m filling the memory as the model and datasets are tiny, but it’s probably some kind of indexing error that might be due to my use of one or more parameters, and I’d like to start from a working example to figure it out. Any suggestions?

The simplest example is the NanoGPT project from Karpathy: GitHub - karpathy/nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs..

See also the training scripts of Hugging Face for causal language modeling, which are a bit more extensive/feature complete: transformers/examples/pytorch/language-modeling at main · huggingface/transformers · GitHub.

Ah, sorry, I did not explain myself clearly since we’re in the Huggingface forums and gave it for granted - I meant simplest using the transformers library. I’m getting errors when launching the training with the Trainer class, and am not sure I’m organizing everything right (datasets, tokenizer etc.). I guess your second link should be good for this, I’m interested in causal modeling so I guess it is for me.