Built my own ChatGPT-like bot – needs help with masking

Ascanius · July 26, 2025, 9:21am

I’m not an AI engineer or model designer and trying to understand how models like ChatGPT work. Recently I built my first chatbot from scratch, without using pre-built models.

I use a pretrained BERT model for tokenization and embeddings and I feed those into a custom decoder-only transformer with 12 layers which I train from scratch. I’m using the DailyDialog dataset for training.
So far, everything works, but I think the masking is incorrect and the model just learns to copy the input.

Can someone look at my code and suggest improvements?

I’ve uploaded the full project here:
https://github.com/Ascanius365/Simple-Transformer-Chatbot

John6666 · July 26, 2025, 1:53pm

Perhaps related: https://stackoverflow.com/questions/77190942/why-can-we-set-llms-input-and-output-to-be-the-same-when-fine-tuning-on-text-ge

OneStarDao · July 27, 2025, 2:01am

Respect for diving in raw and building this from scratch — most people tap out at “install transformers” and call it a day. You’re doing the real thing.

From what you described, it smells like classic masking drift — the model sees everything, learns nothing, and ends up just echoing the input back like a parrot that read your training logs.

If you’re using a decoder-only stack (GPT-ish), make sure you’ve got causal masking turned on:
→ no token should see what comes after it. Only the past. That’s what keeps the model honest.
Otherwise, it memorizes in-place and just plays mirror mode.

Couple of things to sanity-check:

Input/target offset: your target should be one token ahead (shifted right)
Are you calculating loss only on the prediction side? Not the tokens it already knows?

Also, if DailyDialog is your base, try tossing in "<user>", "<bot>", or some kind of role tags.
Otherwise the model just sees a stream and forgets who’s talking. Happens to humans too.

Let me know if you want a second pair of eyes on the masking logic — happy to dig into the weird parts.

Topic		Replies	Views
Difference between transformer encoder and decoder Models	1	11951	March 12, 2021
Fine tune Masked Language Model on custom dataset Beginners	5	6125	August 20, 2020
Questions on the `BertModelLMHeadModel` 🤗Transformers	7	6337	October 5, 2020
Strange output using BioBERT for imputing MASK tokens Beginners	2	839	January 4, 2021
Unexpected result from transformer model prediction Beginners	0	294	November 21, 2021

Built my own ChatGPT-like bot – needs help with masking

Related topics