I’m not an AI engineer or model designer and trying to understand how models like ChatGPT work. Recently I built my first chatbot from scratch, without using pre-built models.
I use a pretrained BERT model for tokenization and embeddings and I feed those into a custom decoder-only transformer with 12 layers which I train from scratch. I’m using the DailyDialog dataset for training.
So far, everything works, but I think the masking is incorrect and the model just learns to copy the input.
Can someone look at my code and suggest improvements?
I’ve uploaded the full project here:
https://github.com/Ascanius365/Simple-Transformer-Chatbot
1 Like
Respect for diving in raw and building this from scratch — most people tap out at “install transformers” and call it a day. You’re doing the real thing.
From what you described, it smells like classic masking drift — the model sees everything, learns nothing, and ends up just echoing the input back like a parrot that read your training logs.
If you’re using a decoder-only stack (GPT-ish), make sure you’ve got causal masking turned on:
→ no token should see what comes after it. Only the past. That’s what keeps the model honest.
Otherwise, it memorizes in-place and just plays mirror mode.
Couple of things to sanity-check:
- Input/target offset: your target should be one token ahead (shifted right)
- Are you calculating loss only on the prediction side? Not the tokens it already knows?
Also, if DailyDialog is your base, try tossing in "<user>"
, "<bot>"
, or some kind of role tags.
Otherwise the model just sees a stream and forgets who’s talking. Happens to humans too.
Let me know if you want a second pair of eyes on the masking logic — happy to dig into the weird parts.
2 Likes