Create a simple and reproducable training process for a GPT-like model?


after watching/reading videos and blog posts I finally managed to build my own script for a GPT like model* including training/testing loop. You can see it here, if you need to see the details - I am not explicitly asking to fix the code, but if you like, you are more than welcome :smiley:

*) Multi-Block-Multi-Head Self-Attention Approach

To evaluate if everything is working, I’d like to achieve some “quick wins”, so I am using a quiet small data set (22 MByte).

Still, when I run everything, the losses are quite high (using CrossEntropy, Loss is betwee 3 and 5).

Also, when I generate a result from a given really small “prompt” the output is bogus.

So I wonder: Are there any tricks or hints to know, to get a reasonable result within a “short time”?

Knowing, that all parameters are inter-dependent, I wonder: What hyper params could help me here? Lower learning rate? More epochs, larger block size? Are there any constrains or “thresholds”, like “no reasonalbe result under 72 hours” or with less then 100 MByte of training data?

I don’t try to get a fully fledged LLM. I just want to reproduce the steps and see simple results, like “Hello” completes to “Hello world” to understand the technology.


Hmm… well, the answer is probably simple. I created a dummy data set containing n repetitions of “Hello World!”. And it seems like this results in low loss but at least my algo seems to work. Still I wonder if there are other ways to achieve that. Still, the model is not very accurate, this is how it completes “Hello”:

Helo Wor
Hello Wo 
Hellorld! Wo