How to build and evaluate a vanilla transformer?

EncoderDecoderModels are supported via the huggingface API. Though it isn’t possible to evaluate them as AutoModel: #28721
How is it possible to build and evaluate a vanilla transformer with an encoder, cross-attention, and a decoder in huggingface?

Model description

“Attention Is All You Need” is a landmark 2017 research paper authored by eight scientists working at Google, responsible for expanding 2014 attention mechanisms proposed by Bahdanau et al. into a new deep learning architecture known as the transformer with an encoder, cross-attention, and a decoder.