https://github.com/huggingface/transformers/issues/16016
Comes from this issue. @valhalla sorry for @ you here again.
My question is:
- they are all GPT2, but is there any differences? including arch, ops etc.
- megatron-lm using FusedLayerNorm, but I don’t see such op inside transformers GPT2, is there equal interms of final predictions?
- what’s the strength of megatron-lm and what’s the weakness compares with transformers?
thank u if anyone could give me a hand.