Fused Kernel Operations

amanrs · July 26, 2022, 4:39am

Megatron-LM incorporates several fused PyTorch operations that, from preliminary testing, offer far better non-parallel speedups than huggingface’s implementations.

For example, even when using fused Adam when training huggingface’s GPT2, I see a 2x difference in throughput compared to Megatron-LM’s gpt_model on a single 16GB V100 GPU.

Though not tested thoroughly, it seems as though the speedups are potentially greater for 40 GB A100’s with larger models.

Does huggingface have support for these fused operations? If not, is this a reasonable feature to add?

Topic		Replies	Views
Differences between transformers GPT2 and megatron-lm? 🤗Transformers	0	377	March 10, 2022
Huggingface using only half of the cores for inference Intermediate	0	518	September 6, 2023
Prakash Hinduja Switzerland (Swiss) Can I use Hugging Face models for real-time inference on edge devices? Beginners	1	22	June 23, 2025
Ssues with GPU Configuration and Model Deployment on Hugging Face Spaces Beginners	0	179	May 29, 2024
Converting NeMo megatron model to Huggingface bert model in pytorch 🤗Hub	0	930	February 13, 2023

Fused Kernel Operations

Related topics