Tutorial: Implementing Transformer from Scratch - A Step-by-Step Guide

bird-of-paradise · December 19, 2024, 12:01am

Hi everyone! Ever wondered how transformers work under the hood? I recently took on the challenge of implementing the Transformer architecture from scratch, and I’ve just published a tutorial to share my journey!

While working on the implementation, I realized that clear documentation would make this more valuable for others learning about transformers. With a little help from Claude to organize and refine my explanations, I’m excited to share the result with you. The code, insights, and learning process are all mine—Claude just made them more accessible!

This tutorial dives into key components like the encoder-decoder stack, attention mechanisms, and even the challenges of testing. You can check it out here

I’d love your feedback—whether it’s suggestions, questions, or ideas for next steps. For example, I’m considering creating another tutorial focusing on training modules, and I’d love to hear what you’d find most useful. This is a ‘first edition,’ and I’m excited to evolve it with your input!

racame75 · December 20, 2024, 7:18pm

It really seems promising.
I’ll indulge into it as soon as I can.
thank a lot for the effort.
Mind if I translate it in french ?

bird-of-paradise · December 20, 2024, 7:43pm

Hi Racame,

Thank you so much for your kind words! I’m thrilled you find the tutorial promising and appreciate the effort that went into it.
You’re more than welcome to translate it into French – thank you for making the content accessible to a broader audience.
While I don’t speak French myself, I’d love to know how the translation process goes and what feedback you get from the French-speaking community. If there’s anything I can do to support the effort, please let me know!

Best regards,
Jen

Bachstelze · December 25, 2024, 10:06pm

Hey @ bird-of-paradise
thanks for the guide. I am looking at how to build and train an encoder-decoder model (based on modernBERT) with the huggingface Trainer: Support modernBERT for encoder-decoder models · Issue #35385 · huggingface/transformers · GitHub
Do you have any advice for it?

bird-of-paradise · December 26, 2024, 1:35am

Hi Bachstelze,
Thanks for your interest! From what I can see in the GitHub issue, the challenge isn’t with the encoder-decoder architecture itself (which is what my tutorial covers), but rather with ModernBERT’s specific implementation in the Hugging Face library. As Niels Rogge pointed out, ModernBERT currently doesn’t support cross-attention, which is needed for encoder-decoder models.

If you’re looking to use ModernBERT specifically, you’d need to either:

Wait for cross-attention support to be added to ModernBERT in the transformers library, or
Consider using another BERT variant that already supports cross-attention

If you’re interested in understanding how cross-attention works in encoder-decoder models, my tutorial might help explain the mechanics, even though it doesn’t specifically address ModernBERT implementation.

shunzh · May 1, 2025, 5:28pm

I was thinking of the same before I just saw your post! I have implemented a decoder-only Transformer from scratch.

Topic		Replies	Views
Training ModernBert+GPT2 Beginners	4	282	January 16, 2025
How to make pure transformer model Beginners	0	140	May 22, 2024
How to build and evaluate a vanilla transformer? Models	0	131	June 26, 2024
Difference between transformer encoder and decoder Models	1	11830	March 12, 2021
Training issue of a Transformer based Encoder-Decoder model based on pre-trained BanglaBERT Models	1	741	May 12, 2022

Tutorial: Implementing Transformer from Scratch - A Step-by-Step Guide

Related topics