Memory Alignment Fragmentation Fix for Transformers Inference (YOLO/BERT) — Kernel-Level + Runtime Observations

Hello Hugging Face team & community,

I wanted to share a deep-dive kernel-level performance optimization that emerged while working with Hugging Face Transformers (YOLOv5 and BERT) models deployed in a server environment using ONNX Runtime + OpenVINO backends.

:chart_decreasing: Observed Issue (During Inference via HF Console):

While debugging inconsistent latency during batched inference, I discovered:

  • A memory warning related to THP alignment from the kernel
  • Memory fragmentation patterns that negatively affected inference throughput
  • Runtime execution errors in the Hugging Face console tied to allocation failures or misaligned buffers, especially under batch sizes 8–32

:hammer_and_wrench: Fix: Kernel Patch on THP Alignment Heuristic

The root issue came from a change in Linux (commit efa7df3e3bb5) that aligned all anonymous memory allocations ≥2MB to PMD boundaries, unintentionally causing fragmentation. This impacted dynamically-sized tensor buffers common in Hugging Face Transformers.

I submitted a patch (now being discussed on LKML) that adjusts this logic:

  • Only force-aligns allocations if their length is a multiple of 2MB
  • Prevents gaps between tensor buffers that prevented hugepage coalescence
  • Improves memory locality, cache efficiency, and TLB behavior

:white_check_mark: Results (Observed While Testing via Hugging Face Console):

  • Fixed previously observed console-level memory fragmentation and buffer errors
  • Inference latency reduced by 40–60% in test runs on Intel Xeon (Cooper Lake) with BERT and YOLOv5
  • Throughput improvements (3x–32x depending on batch size and input length)
  • Console no longer showed tracebacks from misaligned memory-mapped regions

:repeat_button: Relevant Stack:

  • Hugging Face Transformers (BERT-base, YOLOv5)
  • ONNX Runtime
  • OpenVINO backend (v2024.x)
  • Linux Kernel 6.6 → 6.6.8 with patched alignment logic

If any contributors from HF Runtime or Inference API teams are interested in reproducing or validating this impact further (especially in dynamic input or sharded models), I’d be happy to collaborate.

Also curious if there’s a preferred way to surface Hugging Face-specific environment errors observed during model loading or execution (e.g., console-based logs not shown in API execution).

Thanks again for the amazing tools — this issue came up only because of Hugging Face’s flexible model loading and high throughput inference support!

Best Regards,
Siddhartha Sharma
Intel Software Innovator/ISV

1 Like