Differences between transformers GPT2 and megatron-lm?

Comes from this issue. @valhalla sorry for @ you here again.

My question is:

  1. they are all GPT2, but is there any differences? including arch, ops etc.
  2. megatron-lm using FusedLayerNorm, but I don’t see such op inside transformers GPT2, is there equal interms of final predictions?
  3. what’s the strength of megatron-lm and what’s the weakness compares with transformers?

thank u if anyone could give me a hand.