Built my own ChatGPT-like bot – needs help with masking

I’m not an AI engineer or model designer and trying to understand how models like ChatGPT work. Recently I built my first chatbot from scratch, without using pre-built models.

I use a pretrained BERT model for tokenization and embeddings and I feed those into a custom decoder-only transformer with 12 layers which I train from scratch. I’m using the DailyDialog dataset for training.
So far, everything works, but I think the masking is incorrect and the model just learns to copy the input.

Can someone look at my code and suggest improvements?

I’ve uploaded the full project here:
https://github.com/Ascanius365/Simple-Transformer-Chatbot

1 Like

Perhaps related: https://stackoverflow.com/questions/77190942/why-can-we-set-llms-input-and-output-to-be-the-same-when-fine-tuning-on-text-ge

Respect for diving in raw and building this from scratch — most people tap out at “install transformers” and call it a day. You’re doing the real thing.

From what you described, it smells like classic masking drift — the model sees everything, learns nothing, and ends up just echoing the input back like a parrot that read your training logs.

If you’re using a decoder-only stack (GPT-ish), make sure you’ve got causal masking turned on:
→ no token should see what comes after it. Only the past. That’s what keeps the model honest.
Otherwise, it memorizes in-place and just plays mirror mode.

Couple of things to sanity-check:

  • Input/target offset: your target should be one token ahead (shifted right)
  • Are you calculating loss only on the prediction side? Not the tokens it already knows?

Also, if DailyDialog is your base, try tossing in "<user>", "<bot>", or some kind of role tags.
Otherwise the model just sees a stream and forgets who’s talking. Happens to humans too.

Let me know if you want a second pair of eyes on the masking logic — happy to dig into the weird parts.

2 Likes