Hi all,
On Apple Silicon M4 (macOS 15.5, 24GB RAM, torch 2.7.1 + transformers 4.41.2), Iāve encountered a reproducible segmentation fault when running inference on the model dccuchile/bert-base-spanish-wwm-cased
using CPU execution.
The model loads fine via from_pretrained()
, but actual inference (model(input_ids, attention_mask)
) triggers a crash. After tracing the issue via LLDB, Iāve confirmed the fault originates in libomp.dylib
during thread suspension inside libtorch_cpu.dylib
while executing LayerNorm
.
Reproducible on:
MacBook Pro with Apple M4, 24GB RAM, 1TB SSD
Python 3.11.13 (Homebrew)
torch
2.7.1 and transformers
4.41.2
CPU inference only (no GPU, no MPS)
dccuchile/bert-base-spanish-wwm-cased
, but issue may generalize
Not reproducible:
On Intel Macs
When only loading the model (no forward pass)
Attached ZIP package (via iCloud) :
Scripts: IFA_app.py
, repro_beto_loader.py
Full terminal logs (with and without crash)
pip freeze
, system info
LLDB symbolic backtrace (lldb_backtrace_IFA_app.txt
)
README (EN + ES)
Download here:
(iCloud Drive - Apple iCloud )
Would love to know if others on M4 can replicate this, or if thereās known instability around OpenMP / LayerNorm on Apple Silicon CPUs.
Appreciate any insights!
Thanks,
Juan Alberto Ignacio Videla
Buenos Aires, Argentina
1 Like
Update: This issue has now also been reported on GitHub for broader visibility and tracking:
GitHub Issue #39020 ā Segfault on Apple M4 using AutoModelForSequenceClassification with BETO model on CPU
Includes full trace, LLDB backtrace, and Apple Feedback ID FB18354497
.
Happy to collaborate with anyone experiencing similar behavior or investigating libomp / LayerNorm interactions on Apple Silicon.
1 Like
I canāt open iCloudā¦
In any case, I donāt think the cause is Transoformers, as it doesnāt usually cause a SegFault. I think the cause is PyTorch (especially version 2.3, which should not be used in practiceā¦) or the underlying environment. For example, in the case of Apple MPS, there may be compatibility issues with iPython or Jupyter or so.
https://stackoverflow.com/questions/71338821/segmentation-fault-python-after-import-torch-on-mac-m1
https://stackoverflow.com/questions/77812375/pytorch-error-on-mps-apple-silicon-metal
Hi John,
Many Thanks for your reply!
You're right ā Transformers itself is probably not the root cause.
1 Like
Itās unusual that this happens in CPU mode rather than MPS⦠Itās likely a problem with a library in PyTorch that is close to the hardware. It seems that if your PyTorch build is not fairly recent, you may encounter problems with libomp
.
opened 03:15AM - 01 Aug 24 UTC
closed 04:09PM - 02 Aug 24 UTC
module: binaries
module: crash
triaged
module: macos
module: openmp
### š Describe the bug
I encountered a very curious bug: the following code s⦠egfaults on my Apple MacOS M3 Max CPU:
```python
import sklearn
import torch
import numpy as np
torch.tensor(np.zeros((33000,)))
```
`zsh: segmentation fault python3`
**It does NOT segfault if the array size is 32000 and it doesn't segfault if we don't import sklearn before torch.**
Versions:
```
torch 2.3.0.post100
numpy 1.26.4
sklearn 1.5.0
CPU: Apple M3 Max
```
### Versions
PyTorch version: 2.3.0.post100
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 14.3.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.1.0.2.5)
CMake version: Could not collect
Libc version: N/A
Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:54:21) [Clang 16.0.6 ] (64-bit runtime)
Python platform: macOS-14.3.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Apple M3 Max
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.3.0.post100
[conda] numpy 1.26.4 py312h7f4fdc5_0
[conda] numpy-base 1.26.4 py312he047099_0
[conda] pytorch 2.3.0 gpu_mps_py312hf36b297_100
cc @seemethere @malfet @osalpekar @atalman @albanD
opened 08:06PM - 17 Sep 24 UTC
module: crash
triaged
module: macos
module: openmp
### š I`m using libtorch in C++ project, it worked for a while then I got this a⦠fter running the binary.
The bizare thing is that it worked for a while, even on GPU, I have trained and everything worked. Now it fails on simple `torch::relu(tensor)` ...
any idea? is m1 related? older version will?
Here is the logs after `./cnn_bin`:
```
LibTorch version: 2.4.1
Tensor created successfully
AddressSanitizer:DEADLYSIGNAL
=================================================================
AddressSanitizer:DEADLYSIGNAL
==6330==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000010 (pc 0x0001213f0cf0 bp 0x00016be8aa50 sp 0x00016be8a9a0 T2)
AddressSanitizer:DEADLYSIGNAL
==6330==The signal is caused by a READ memory access.
==6330==Hint: address points to the zero page.
#0 0x1213f0cf0 in void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*)+0x30 (libomp.dylib:arm64+0x54cf0)
#1 0x12174151c in kmp_flag_64<false, true>::wait(kmp_info*, int, void*)+0x754 (libomp.dylib:arm64+0x4551c)
#2 0x12173c55c in __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*)+0xb4 (libomp.dylib:arm64+0x4055c)
#3 0x1217400e4 in __kmp_fork_barrier(int, int)+0x270 (libomp.dylib:arm64+0x440e4)
#4 0x12171ce10 in __kmp_launch_thread+0x150 (libomp.dylib:arm64+0x20e10)
#5 0x12175b008 in __kmp_launch_worker(void*)+0x114 (libomp.dylib:arm64+0x5f008)
#6 0x180976030 in _pthread_start+0x84 (libsystem_pthread.dylib:arm64e+0x7030)
#7 0x180970e38 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x1e38)
==6330==Register values:
x[0] = 0x0000000000000002 x[1] = 0x000000016be8ab30 x[2] = 0x0000000000000000 x[3] = 0x0000000fffffc088
x[4] = 0x0000000000000001 x[5] = 0x0000000000000000 x[6] = 0x000000016be8aca0 x[7] = 0x0000000000000000
x[8] = 0x0000000000000000 x[9] = 0x000000007fffffff x[10] = 0x00000000000003e8 x[11] = 0xce5899d053670034
x[12] = 0x00000000016e3600 x[13] = 0x000000000007e8e8 x[14] = 0x0000000000000000 x[15] = 0x0000000000000000
x[16] = 0x00000001213f0cc0 x[17] = 0x00000001e02c7480 x[18] = 0x0000000000000000 x[19] = 0x000000014a4f49c0
x[20] = 0x000000016be8ab30 x[21] = 0x000000016be8ab30 x[22] = 0x0000000121792c80 x[23] = 0x00000001217885a8
x[24] = 0x0000000000000002 x[25] = 0x0000000000000000 x[26] = 0x0000000121788548 x[27] = 0x000000014a4f4f08
x[28] = 0x000000012178b1e0 fp = 0x000000016be8aa50 lr = 0x0000000121741520 sp = 0x000000016be8a9a0
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (libomp.dylib:arm64+0x54cf0) in void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*)+0x30
Thread T2 created by T0 here:
#0 0x10864c1b0 in wrap_pthread_create+0x54 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x4c1b0)
#1 0x12175ab50 in __kmp_create_worker+0xcc (libomp.dylib:arm64+0x5eb50)
#2 0x12171cbb0 in __kmp_allocate_thread+0x420 (libomp.dylib:arm64+0x20bb0)
#3 0x121717640 in __kmp_allocate_team+0x90c (libomp.dylib:arm64+0x1b640)
#4 0x121719438 in __kmp_fork_call+0x16f8 (libomp.dylib:arm64+0x1d438)
#5 0x12170c084 in __kmpc_fork_call+0xc0 (libomp.dylib:arm64+0x10084)
#6 0x111140170 in at::TensorIteratorBase::for_each(c10::function_ref<void (char**, long long const*, long long, long long)>, long long)+0x1ac (libtorch_cpu.dylib:arm64+0xb4170)
#7 0x1133a8644 in at::native::(anonymous namespace)::clamp_min_scalar_kernel_impl(at::TensorIteratorBase&, c10::Scalar)+0x378 (libtorch_cpu.dylib:arm64+0x231c644)
#8 0x1117da30c in void at::native::DispatchStub<void (*)(at::TensorIteratorBase&, c10::Scalar), at::native::clamp_min_scalar_stub_DECLARE_DISPATCH_type>::operator()<at::native::structured_clamp_min_out&, c10::Scalar const&>(c10::DeviceType, at::native::structured_clamp_min_out&, c10::Scalar const&)+0x74 (libtorch_cpu.dylib:arm64+0x74e30c)
#9 0x1123dbeac in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::Scalar const&), &at::(anonymous namespace)::wrapper_CPU_clamp_min(at::Tensor const&, c10::Scalar const&)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::Scalar const&>>, at::Tensor (at::Tensor const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&)+0x78 (libtorch_cpu.dylib:arm64+0x134feac)
#10 0x112109be0 in at::_ops::clamp_min::call(at::Tensor const&, c10::Scalar const&)+0x114 (libtorch_cpu.dylib:arm64+0x107dbe0)
#11 0x1113f6014 in at::native::relu(at::Tensor const&)+0x4c (libtorch_cpu.dylib:arm64+0x36a014)
#12 0x11456d214 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::relu(c10::DispatchKeySet, at::Tensor const&)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&)+0x290 (libtorch_cpu.dylib:arm64+0x34e1214)
#13 0x1122742bc in at::_ops::relu::call(at::Tensor const&)+0x10c (libtorch_cpu.dylib:arm64+0x11e82bc)
#14 0x104f92e98 in at::relu(at::Tensor const&) relu.h:27
#15 0x104f92270 in main main.cpp:16
#16 0x1805f50dc (<unknown module>)
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer:DEADLYSIGNAL
==6330==ABORTING
```
### My simple program:
```cpp
#include <torch/torch.h>
#include <iostream>
int main() {
std::cout << "LibTorch version: " << TORCH_VERSION_MAJOR << "."
<< TORCH_VERSION_MINOR << "."
<< TORCH_VERSION_PATCH << std::endl;
try {
// Create a tensor
auto tensor = torch::randn({1, 3, 720, 720});
std::cout << "Tensor created successfully" << std::endl;
// Perform a simple operation
auto result = torch::relu(tensor);
std::cout << "Operation successful, result tensor size: " << result.sizes() << std::endl;
} catch (const std::exception& e) {
std::cerr << "Exception: " << e.what() << std::endl;
return -1;
}
return 0;
}
```
### CMakeLists.txt
```
cmake_minimum_required(VERSION 3.10)
project(cnn_bin)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_BUILD_TYPE Debug)
# Path to the LibTorch folder (inside your project)
set(TORCH_PATH "${CMAKE_SOURCE_DIR}/libtorch")
# Find OpenCV
find_package(OpenCV REQUIRED)
# Find LibTorch
find_package(Torch REQUIRED PATHS ${TORCH_PATH}/share/cmake/Torch)
# Add include directories
include_directories(${OpenCV_INCLUDE_DIRS})
# Link OpenCV libraries
link_directories(${OpenCV_LIBRARY_DIRS})
# Enable AddressSanitizer
if(CMAKE_BUILD_TYPE STREQUAL "Debug")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address")
set(CMAKE_LINKER_FLAGS "${CMAKE_LINKER_FLAGS} -fsanitize=address")
endif()
# Add executable
add_executable(cnn_bin main.cpp)
# Link TensorFlow library
target_link_libraries(cnn_bin "${TORCH_LIBRARIES}" ${OpenCV_LIBS})
set_property(TARGET cnn_bin PROPERTY CXX_STANDARD 17)
```
### Versions
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A
OS: macOS 14.3 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.1.0.2.5)
CMake version: version 3.25.2
Libc version: N/A
Python version: 3.9.6 (default, Dec 7 2023, 05:42:47) [Clang 15.0.0 (clang-1500.1.0.2.5)] (64-bit runtime)
Python platform: macOS-14.3-arm64-arm-64bit
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A
CPU:
Apple M1 Pro
Versions of relevant libraries:
[pip3] No relevant packages
[conda] Could not collect
cc @malfet @albanD
Hi [John / Nikita / team],
Thank you for your responses and for following up. I confirm that Iām working on an environment running Apple Silicon M4 , and I was using PyTorch 2.3.0 along with torchvision 0.18.1 ā both installed via PyPI under the āstableā label.
Following your suggestions, Iāll be upgrading to PyTorch 2.7.1 and checking compatibility with the appropriate torchvision version (likely 0.18.3 or newer). Iāll also reinstall libomp
to ensure there are no low-level conflicts related to CPU parallelism.
Iāll run tests over the next few days and share the results. Given that the M4 chip is relatively new, I understand these reports might be helpful for future compatibility validation.
Thanks again for the support ā Iāll stay in touch with any updates.
Best regards, Juan Alberto Ignacio Videla
Buenos Aires - Argentina
1 Like