Training speed vs Megatron

I try to pretrain a language model so I made some speed tests about transformers and Megatron. I found if I pre-tokenize the data and use flash-attention, the training speed is on par with Megatron in one node with 8 GPU.
Is this correct? As I know, transformers is slower than Megatron, especially for pretrain from scratch.