Tutorial: Implementing Transformer from Scratch - A Step-by-Step Guide

Hi Bachstelze,
Thanks for your interest! From what I can see in the GitHub issue, the challenge isn’t with the encoder-decoder architecture itself (which is what my tutorial covers), but rather with ModernBERT’s specific implementation in the Hugging Face library. As Niels Rogge pointed out, ModernBERT currently doesn’t support cross-attention, which is needed for encoder-decoder models.

If you’re looking to use ModernBERT specifically, you’d need to either:

  1. Wait for cross-attention support to be added to ModernBERT in the transformers library, or
  2. Consider using another BERT variant that already supports cross-attention

If you’re interested in understanding how cross-attention works in encoder-decoder models, my tutorial might help explain the mechanics, even though it doesn’t specifically address ModernBERT implementation.

1 Like