Segfault during PyTorch + Transformers inference on Apple Silicon M4 (libomp.dylib crash on LayerNorm)

JAIV1961M4 · June 27, 2025, 5:46pm

Hi all,

On Apple Silicon M4 (macOS 15.5, 24GB RAM, torch 2.7.1 + transformers 4.41.2), I’ve encountered a reproducible segmentation fault when running inference on the model dccuchile/bert-base-spanish-wwm-cased using CPU execution.

The model loads fine via from_pretrained(), but actual inference (model(input_ids, attention_mask)) triggers a crash. After tracing the issue via LLDB, I’ve confirmed the fault originates in libomp.dylib during thread suspension inside libtorch_cpu.dylib while executing LayerNorm.

Reproducible on:

MacBook Pro with Apple M4, 24GB RAM, 1TB SSD
Python 3.11.13 (Homebrew)
torch 2.7.1 and transformers 4.41.2
CPU inference only (no GPU, no MPS)
dccuchile/bert-base-spanish-wwm-cased, but issue may generalize

Not reproducible:

On Intel Macs
When only loading the model (no forward pass)

Attached ZIP package (via iCloud):

Scripts: IFA_app.py, repro_beto_loader.py
Full terminal logs (with and without crash)
pip freeze, system info
LLDB symbolic backtrace (lldb_backtrace_IFA_app.txt)
README (EN + ES)

Download here:
(iCloud Drive - Apple iCloud)

Would love to know if others on M4 can replicate this, or if there’s known instability around OpenMP / LayerNorm on Apple Silicon CPUs.

Appreciate any insights!
Thanks,
Juan Alberto Ignacio Videla
Buenos Aires, Argentina

JAIV1961M4 · June 27, 2025, 8:49pm

Update: This issue has now also been reported on GitHub for broader visibility and tracking:

GitHub Issue #39020 – Segfault on Apple M4 using AutoModelForSequenceClassification with BETO model on CPU

Includes full trace, LLDB backtrace, and Apple Feedback ID FB18354497.
Happy to collaborate with anyone experiencing similar behavior or investigating libomp / LayerNorm interactions on Apple Silicon.

John6666 · June 28, 2025, 4:13am

I can’t open iCloud…

In any case, I don’t think the cause is Transoformers, as it doesn’t usually cause a SegFault. I think the cause is PyTorch (especially version 2.3, which should not be used in practice…) or the underlying environment. For example, in the case of Apple MPS, there may be compatibility issues with iPython or Jupyter or so.
https://stackoverflow.com/questions/71338821/segmentation-fault-python-after-import-torch-on-mac-m1

https://stackoverflow.com/questions/77812375/pytorch-error-on-mps-apple-silicon-metal

JAIV1961M4 · June 29, 2025, 9:18pm

Hi John,

Many Thanks for your reply!

You're right — Transformers itself is probably not the root cause.

John6666 · June 30, 2025, 1:58am

It’s unusual that this happens in CPU mode rather than MPS… It’s likely a problem with a library in PyTorch that is close to the hardware. It seems that if your PyTorch build is not fairly recent, you may encounter problems with libomp.

github.com/pytorch/pytorch

Bizarre segmentation fault on MacOS M3 when sklearn is imported

opened 03:15AM - 01 Aug 24 UTC

closed 04:09PM - 02 Aug 24 UTC

nskocabey

module: binaries module: crash triaged module: macos module: openmp

### 🐛 Describe the bug I encountered a very curious bug: the following code s…egfaults on my Apple MacOS M3 Max CPU: ```python import sklearn import torch import numpy as np torch.tensor(np.zeros((33000,))) ``` `zsh: segmentation fault python3` **It does NOT segfault if the array size is 32000 and it doesn't segfault if we don't import sklearn before torch.** Versions: ``` torch 2.3.0.post100 numpy 1.26.4 sklearn 1.5.0 CPU: Apple M3 Max ``` ### Versions PyTorch version: 2.3.0.post100 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 14.3.1 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.1.0.2.5) CMake version: Could not collect Libc version: N/A Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:54:21) [Clang 16.0.6 ] (64-bit runtime) Python platform: macOS-14.3.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Apple M3 Max Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.3.0.post100 [conda] numpy 1.26.4 py312h7f4fdc5_0 [conda] numpy-base 1.26.4 py312he047099_0 [conda] pytorch 2.3.0 gpu_mps_py312hf36b297_100 cc @seemethere @malfet @osalpekar @atalman @albanD

github.com/pytorch/pytorch

Segmentaion fault on Apple M1 caused by libomp on v2.4.1

opened 08:06PM - 17 Sep 24 UTC

adriantabirta

module: crash triaged module: macos module: openmp

### 🐛 I`m using libtorch in C++ project, it worked for a while then I got this a…fter running the binary. The bizare thing is that it worked for a while, even on GPU, I have trained and everything worked. Now it fails on simple `torch::relu(tensor)` ... any idea? is m1 related? older version will? Here is the logs after `./cnn_bin`: ``` LibTorch version: 2.4.1 Tensor created successfully AddressSanitizer:DEADLYSIGNAL ================================================================= AddressSanitizer:DEADLYSIGNAL ==6330==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000010 (pc 0x0001213f0cf0 bp 0x00016be8aa50 sp 0x00016be8a9a0 T2) AddressSanitizer:DEADLYSIGNAL ==6330==The signal is caused by a READ memory access. ==6330==Hint: address points to the zero page. #0 0x1213f0cf0 in void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*)+0x30 (libomp.dylib:arm64+0x54cf0) #1 0x12174151c in kmp_flag_64<false, true>::wait(kmp_info*, int, void*)+0x754 (libomp.dylib:arm64+0x4551c) #2 0x12173c55c in __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*)+0xb4 (libomp.dylib:arm64+0x4055c) #3 0x1217400e4 in __kmp_fork_barrier(int, int)+0x270 (libomp.dylib:arm64+0x440e4) #4 0x12171ce10 in __kmp_launch_thread+0x150 (libomp.dylib:arm64+0x20e10) #5 0x12175b008 in __kmp_launch_worker(void*)+0x114 (libomp.dylib:arm64+0x5f008) #6 0x180976030 in _pthread_start+0x84 (libsystem_pthread.dylib:arm64e+0x7030) #7 0x180970e38 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x1e38) ==6330==Register values: x[0] = 0x0000000000000002 x[1] = 0x000000016be8ab30 x[2] = 0x0000000000000000 x[3] = 0x0000000fffffc088 x[4] = 0x0000000000000001 x[5] = 0x0000000000000000 x[6] = 0x000000016be8aca0 x[7] = 0x0000000000000000 x[8] = 0x0000000000000000 x[9] = 0x000000007fffffff x[10] = 0x00000000000003e8 x[11] = 0xce5899d053670034 x[12] = 0x00000000016e3600 x[13] = 0x000000000007e8e8 x[14] = 0x0000000000000000 x[15] = 0x0000000000000000 x[16] = 0x00000001213f0cc0 x[17] = 0x00000001e02c7480 x[18] = 0x0000000000000000 x[19] = 0x000000014a4f49c0 x[20] = 0x000000016be8ab30 x[21] = 0x000000016be8ab30 x[22] = 0x0000000121792c80 x[23] = 0x00000001217885a8 x[24] = 0x0000000000000002 x[25] = 0x0000000000000000 x[26] = 0x0000000121788548 x[27] = 0x000000014a4f4f08 x[28] = 0x000000012178b1e0 fp = 0x000000016be8aa50 lr = 0x0000000121741520 sp = 0x000000016be8a9a0 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV (libomp.dylib:arm64+0x54cf0) in void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*)+0x30 Thread T2 created by T0 here: #0 0x10864c1b0 in wrap_pthread_create+0x54 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x4c1b0) #1 0x12175ab50 in __kmp_create_worker+0xcc (libomp.dylib:arm64+0x5eb50) #2 0x12171cbb0 in __kmp_allocate_thread+0x420 (libomp.dylib:arm64+0x20bb0) #3 0x121717640 in __kmp_allocate_team+0x90c (libomp.dylib:arm64+0x1b640) #4 0x121719438 in __kmp_fork_call+0x16f8 (libomp.dylib:arm64+0x1d438) #5 0x12170c084 in __kmpc_fork_call+0xc0 (libomp.dylib:arm64+0x10084) #6 0x111140170 in at::TensorIteratorBase::for_each(c10::function_ref<void (char**, long long const*, long long, long long)>, long long)+0x1ac (libtorch_cpu.dylib:arm64+0xb4170) #7 0x1133a8644 in at::native::(anonymous namespace)::clamp_min_scalar_kernel_impl(at::TensorIteratorBase&, c10::Scalar)+0x378 (libtorch_cpu.dylib:arm64+0x231c644) #8 0x1117da30c in void at::native::DispatchStub<void (*)(at::TensorIteratorBase&, c10::Scalar), at::native::clamp_min_scalar_stub_DECLARE_DISPATCH_type>::operator()<at::native::structured_clamp_min_out&, c10::Scalar const&>(c10::DeviceType, at::native::structured_clamp_min_out&, c10::Scalar const&)+0x74 (libtorch_cpu.dylib:arm64+0x74e30c) #9 0x1123dbeac in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::Scalar const&), &at::(anonymous namespace)::wrapper_CPU_clamp_min(at::Tensor const&, c10::Scalar const&)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::Scalar const&>>, at::Tensor (at::Tensor const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&)+0x78 (libtorch_cpu.dylib:arm64+0x134feac) #10 0x112109be0 in at::_ops::clamp_min::call(at::Tensor const&, c10::Scalar const&)+0x114 (libtorch_cpu.dylib:arm64+0x107dbe0) #11 0x1113f6014 in at::native::relu(at::Tensor const&)+0x4c (libtorch_cpu.dylib:arm64+0x36a014) #12 0x11456d214 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::relu(c10::DispatchKeySet, at::Tensor const&)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&)+0x290 (libtorch_cpu.dylib:arm64+0x34e1214) #13 0x1122742bc in at::_ops::relu::call(at::Tensor const&)+0x10c (libtorch_cpu.dylib:arm64+0x11e82bc) #14 0x104f92e98 in at::relu(at::Tensor const&) relu.h:27 #15 0x104f92270 in main main.cpp:16 #16 0x1805f50dc (<unknown module>) AddressSanitizer:DEADLYSIGNAL AddressSanitizer:DEADLYSIGNAL AddressSanitizer:DEADLYSIGNAL AddressSanitizer:DEADLYSIGNAL ==6330==ABORTING ``` ### My simple program: ```cpp #include <torch/torch.h> #include <iostream> int main() { std::cout << "LibTorch version: " << TORCH_VERSION_MAJOR << "." << TORCH_VERSION_MINOR << "." << TORCH_VERSION_PATCH << std::endl; try { // Create a tensor auto tensor = torch::randn({1, 3, 720, 720}); std::cout << "Tensor created successfully" << std::endl; // Perform a simple operation auto result = torch::relu(tensor); std::cout << "Operation successful, result tensor size: " << result.sizes() << std::endl; } catch (const std::exception& e) { std::cerr << "Exception: " << e.what() << std::endl; return -1; } return 0; } ``` ### CMakeLists.txt ``` cmake_minimum_required(VERSION 3.10) project(cnn_bin) set(CMAKE_CXX_STANDARD 17) set(CMAKE_BUILD_TYPE Debug) # Path to the LibTorch folder (inside your project) set(TORCH_PATH "${CMAKE_SOURCE_DIR}/libtorch") # Find OpenCV find_package(OpenCV REQUIRED) # Find LibTorch find_package(Torch REQUIRED PATHS ${TORCH_PATH}/share/cmake/Torch) # Add include directories include_directories(${OpenCV_INCLUDE_DIRS}) # Link OpenCV libraries link_directories(${OpenCV_LIBRARY_DIRS}) # Enable AddressSanitizer if(CMAKE_BUILD_TYPE STREQUAL "Debug") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address") set(CMAKE_LINKER_FLAGS "${CMAKE_LINKER_FLAGS} -fsanitize=address") endif() # Add executable add_executable(cnn_bin main.cpp) # Link TensorFlow library target_link_libraries(cnn_bin "${TORCH_LIBRARIES}" ${OpenCV_LIBS}) set_property(TARGET cnn_bin PROPERTY CXX_STANDARD 17) ``` ### Versions Collecting environment information... PyTorch version: N/A Is debug build: N/A CUDA used to build PyTorch: N/A ROCM used to build PyTorch: N/A OS: macOS 14.3 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.1.0.2.5) CMake version: version 3.25.2 Libc version: N/A Python version: 3.9.6 (default, Dec 7 2023, 05:42:47) [Clang 15.0.0 (clang-1500.1.0.2.5)] (64-bit runtime) Python platform: macOS-14.3-arm64-arm-64bit Is CUDA available: N/A CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: N/A GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: N/A CPU: Apple M1 Pro Versions of relevant libraries: [pip3] No relevant packages [conda] Could not collect cc @malfet @albanD

JAIV1961M4 · June 30, 2025, 2:49am

Hi [John / Nikita / team],

Thank you for your responses and for following up. I confirm that I’m working on an environment running Apple Silicon M4, and I was using PyTorch 2.3.0 along with torchvision 0.18.1 — both installed via PyPI under the “stable” label.

Following your suggestions, I’ll be upgrading to PyTorch 2.7.1 and checking compatibility with the appropriate torchvision version (likely 0.18.3 or newer). I’ll also reinstall libomp to ensure there are no low-level conflicts related to CPU parallelism.

I’ll run tests over the next few days and share the results. Given that the M4 chip is relatively new, I understand these reports might be helpful for future compatibility validation.

Thanks again for the support — I’ll stay in touch with any updates.

Best regards, Juan Alberto Ignacio Videla

Buenos Aires - Argentina

Topic		Replies	Views
Loading models sometimes maxes DISK%, then crashes Intermediate	2	2882	October 8, 2020
Segmentation fault (core dumped) 🤗Transformers	13	18387	August 9, 2024
Segmentation Fault while runing example from token classification 🤗Transformers	0	1079	September 13, 2022
Issues loading NLLB 54B MoE model for multi-GPU inferencing using accelerate 🤗Transformers	0	900	April 22, 2023
Inference is slow on M1 Mac despite MPS Torch backend Beginners	4	3652	May 26, 2024

Segfault during PyTorch + Transformers inference on Apple Silicon M4 (libomp.dylib crash on LayerNorm)

Related topics