Fused Kernel Operations

Megatron-LM incorporates several fused PyTorch operations that, from preliminary testing, offer far better non-parallel speedups than huggingface’s implementations.

For example, even when using fused Adam when training huggingface’s GPT2, I see a 2x difference in throughput compared to Megatron-LM’s gpt_model on a single 16GB V100 GPU.

Though not tested thoroughly, it seems as though the speedups are potentially greater for 40 GB A100’s with larger models.

Does huggingface have support for these fused operations? If not, is this a reasonable feature to add?