Memory Alignment Fragmentation Fix for Transformers Inference (YOLO/BERT) — Kernel-Level + Runtime Observations

Sids99 · July 7, 2025, 10:58pm

Hello Hugging Face team & community,

I wanted to share a deep-dive kernel-level performance optimization that emerged while working with Hugging Face Transformers (YOLOv5 and BERT) models deployed in a server environment using ONNX Runtime + OpenVINO backends.

Observed Issue (During Inference via HF Console):

While debugging inconsistent latency during batched inference, I discovered:

A memory warning related to THP alignment from the kernel
Memory fragmentation patterns that negatively affected inference throughput
Runtime execution errors in the Hugging Face console tied to allocation failures or misaligned buffers, especially under batch sizes 8–32

Fix: Kernel Patch on THP Alignment Heuristic

The root issue came from a change in Linux (commit efa7df3e3bb5) that aligned all anonymous memory allocations ≥2MB to PMD boundaries, unintentionally causing fragmentation. This impacted dynamically-sized tensor buffers common in Hugging Face Transformers.

I submitted a patch (now being discussed on LKML) that adjusts this logic:

Only force-aligns allocations if their length is a multiple of 2MB
Prevents gaps between tensor buffers that prevented hugepage coalescence
Improves memory locality, cache efficiency, and TLB behavior

Results (Observed While Testing via Hugging Face Console):

Fixed previously observed console-level memory fragmentation and buffer errors
Inference latency reduced by 40–60% in test runs on Intel Xeon (Cooper Lake) with BERT and YOLOv5
Throughput improvements (3x–32x depending on batch size and input length)
Console no longer showed tracebacks from misaligned memory-mapped regions

Relevant Stack:

Hugging Face Transformers (BERT-base, YOLOv5)
ONNX Runtime
OpenVINO backend (v2024.x)
Linux Kernel 6.6 → 6.6.8 with patched alignment logic

If any contributors from HF Runtime or Inference API teams are interested in reproducing or validating this impact further (especially in dynamic input or sharded models), I’d be happy to collaborate.

Also curious if there’s a preferred way to surface Hugging Face-specific environment errors observed during model loading or execution (e.g., console-based logs not shown in API execution).

Thanks again for the amazing tools — this issue came up only because of Hugging Face’s flexible model loading and high throughput inference support!

Best Regards,
Siddhartha Sharma
Intel Software Innovator/ISV

Topic		Replies	Views
Advice to speed and performance 🤗Transformers	4	7218	December 7, 2020
Huggingface using only half of the cores for inference Intermediate	0	518	September 6, 2023
Open Source survey results [Jan 2022] Community Calls	1	2256	March 10, 2022
Supporting ONNX optimized models 🤗Transformers	16	3041	September 1, 2021
Hugging Face Reads - 01/2021 - Sparsity and Pruning Research	14	7487	June 3, 2025

Memory Alignment Fragmentation Fix for Transformers Inference (YOLO/BERT) — Kernel-Level + Runtime Observations

Related topics