Hello, I am trying to use accelerate
with fastai
to achieve distributed training. The SLURM system that I have access to has 4 p100 GPUs.
Tue Oct 4 13:20:24 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:04:00.0 Off | 0 |
| N/A 32C P0 26W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:05:00.0 Off | 0 |
| N/A 32C P0 27W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000000:06:00.0 Off | 0 |
| N/A 32C P0 28W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 00000000:07:00.0 Off | 0 |
| N/A 33C P0 25W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Here is the config
{
"compute_environment": "LOCAL_MACHINE",
"deepspeed_config": {},
"distributed_type": "MULTI_GPU",
"downcast_bf16": false,
"fsdp_config": {},
"machine_rank": 0,
"main_process_ip": null,
"main_process_port": null,
"main_training_function": "main",
"mixed_precision": "no",
"num_machines": 1,
"num_processes": 4,
"use_cpu": false,
}
Here is the code
learn = vision_learner(dls,
resnet50,
metrics=[
partial(accuracy_multi, thresh=0.5),
f1score_multi_avg, f1score_multi, f2score_multi
],
cbs=[WandbCallback()])
with learn.distrib_ctx():
learn.fine_tune(settings.EPOCHS, ideal_lr[0], freeze_epochs=4)
Here is the error, I get
terminate called after throwing an instance of 'terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::CUDAOutOfMemoryErrorc10::CUDAOutOfMemoryErrorc10::CUDAOutOfMemoryError'
'
'
what(): what(): CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 15.90 GiB total capacity; 6.12 GiB already allocated; 102.81 MiB free; 6.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at /tmp/coulombc/pytorch_build_2021-11-09_14-57-01/avx2/python3.7/pytorch/c10/cuda/CUDACachingAllocator.cpp:513 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x2abd25a52905 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x295bf (0x2abd259f45bf in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2a2c5 (0x2abd259f52c5 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x2a7d2 (0x2abd259f57d2 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xef (0x2abd0c73552f in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x317750c (0x2abd0c88450c in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x317757b (0x2abd0c88457b in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x1e54ca5 (0x2abd00fa8ca5 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
what(): CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 15.90 GiB total capacity; 1.05 GiB already allocated; 196.81 MiB free; 1.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at /tmp/coulombc/pytorch_build_2021-11-09_14-57-01/avx2/python3.7/pytorch/c10/cuda/CUDACachingAllocator.cpp:513 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x2ad79cc5c905 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x295bf (0x2ad79cbfe5bf in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2a2c5 (0x2ad79cbff2c5 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x2a7d2 (0x2ad79cbff7d2 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xef (0x2ad78393f52f in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x317703a (0x2ad783a8e03a in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x31bb12a (0x2ad783ad212a in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: at::meta::structured_max_pool2d_with_indices::meta(at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool) + 0x89e (0x2ad77784e0ee in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 15.90 GiB total capacity; 2.00 GiB already allocated; 344.81 MiB free; 2.03 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at /tmp/coulombc/pytorch_build_2021-11-09_14-57-01/avx2/python3.7/pytorch/c10/cuda/CUDACachingAllocator.cpp:513 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x2ba307a90905 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x295bf (0x2ba307a325bf in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2a2c5 (0x2ba307a332c5 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x2a7d2 (0x2ba307a337d2 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xef (0x2ba2ee77352f in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x317750c (0x2ba2ee8c250c in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x317757b (0x2ba2ee8c257b in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x1e54ca5 (0x2ba2e2fe6ca5 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError'
what(): CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 15.90 GiB total capacity; 3.15 GiB already allocated; 100.81 MiB free; 3.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at /tmp/coulombc/pytorch_build_2021-11-09_14-57-01/avx2/python3.7/pytorch/c10/cuda/CUDACachingAllocator.cpp:513 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x2b01394d1905 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x295bf (0x2b01394735bf in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2a2c5 (0x2b01394742c5 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x2a7d2 (0x2b01394747d2 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xef (0x2b01201b452f in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x3082d6e (0x2b012020ed6e in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x96 (0x2b012020fa16 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x314f4e8 (0x2b01202db4e8 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x314f590 (0x2b01202db590 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: at::_ops::cudnn_convolution::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x16f (0x2b01148ba99f in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
Am I getting this error because of an improper config? I have seen this error before, its because memory fragmentation of the GPU, in the past, the usual solution was to just restart the kernel and try to train the model with a smaller batch size. I’m not sure how I would do that in SLURM. TIA!