Tutorial: Implementing Transformer from Scratch - A Step-by-Step Guide

Hi everyone! Ever wondered how transformers work under the hood? I recently took on the challenge of implementing the Transformer architecture from scratch, and I’ve just published a tutorial to share my journey!

While working on the implementation, I realized that clear documentation would make this more valuable for others learning about transformers. With a little help from Claude to organize and refine my explanations, I’m excited to share the result with you. The code, insights, and learning process are all mine—Claude just made them more accessible!

This tutorial dives into key components like the encoder-decoder stack, attention mechanisms, and even the challenges of testing. You can check it out here

I’d love your feedback—whether it’s suggestions, questions, or ideas for next steps. For example, I’m considering creating another tutorial focusing on training modules, and I’d love to hear what you’d find most useful. This is a ‘first edition,’ and I’m excited to evolve it with your input!

1 Like

It really seems promising.
I’ll indulge into it as soon as I can.
thank a lot for the effort.
Mind if I translate it in french ?

1 Like

Hi Racame,

Thank you so much for your kind words! I’m thrilled you find the tutorial promising and appreciate the effort that went into it. :blush:
You’re more than welcome to translate it into French – thank you for making the content accessible to a broader audience.
While I don’t speak French myself, I’d love to know how the translation process goes and what feedback you get from the French-speaking community. If there’s anything I can do to support the effort, please let me know!

Best regards,
Jen

1 Like

Hey @ bird-of-paradise
thanks for the guide. I am looking at how to build and train an encoder-decoder model (based on modernBERT) with the huggingface Trainer: Support modernBERT for encoder-decoder models · Issue #35385 · huggingface/transformers · GitHub
Do you have any advice for it?

1 Like

Hi Bachstelze,
Thanks for your interest! From what I can see in the GitHub issue, the challenge isn’t with the encoder-decoder architecture itself (which is what my tutorial covers), but rather with ModernBERT’s specific implementation in the Hugging Face library. As Niels Rogge pointed out, ModernBERT currently doesn’t support cross-attention, which is needed for encoder-decoder models.

If you’re looking to use ModernBERT specifically, you’d need to either:

  1. Wait for cross-attention support to be added to ModernBERT in the transformers library, or
  2. Consider using another BERT variant that already supports cross-attention

If you’re interested in understanding how cross-attention works in encoder-decoder models, my tutorial might help explain the mechanics, even though it doesn’t specifically address ModernBERT implementation.

1 Like