Use `accelerate` in SLURM environment

Hello, I am trying to use accelerate with fastai to achieve distributed training. The SLURM system that I have access to has 4 p100 GPUs.

Tue Oct  4 13:20:24 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   32C    P0    26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:05:00.0 Off |                    0 |
| N/A   32C    P0    27W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 00000000:06:00.0 Off |                    0 |
| N/A   32C    P0    28W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  Off  | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0    25W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here is the config

{
  "compute_environment": "LOCAL_MACHINE",
  "deepspeed_config": {},
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "fsdp_config": {},
  "machine_rank": 0,
  "main_process_ip": null,
  "main_process_port": null,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 1,
  "num_processes": 4,
  "use_cpu": false,
}

Here is the code

  learn = vision_learner(dls,
                         resnet50,
                         metrics=[
                             partial(accuracy_multi, thresh=0.5),
                             f1score_multi_avg, f1score_multi, f2score_multi
                         ],
                         cbs=[WandbCallback()])
with learn.distrib_ctx():
      learn.fine_tune(settings.EPOCHS, ideal_lr[0], freeze_epochs=4)

Here is the error, I get

terminate called after throwing an instance of 'terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::CUDAOutOfMemoryErrorc10::CUDAOutOfMemoryErrorc10::CUDAOutOfMemoryError'
'
'
  what():    what():  CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 15.90 GiB total capacity; 6.12 GiB already allocated; 102.81 MiB free; 6.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at /tmp/coulombc/pytorch_build_2021-11-09_14-57-01/avx2/python3.7/pytorch/c10/cuda/CUDACachingAllocator.cpp:513 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x2abd25a52905 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x295bf (0x2abd259f45bf in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2a2c5 (0x2abd259f52c5 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x2a7d2 (0x2abd259f57d2 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xef (0x2abd0c73552f in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x317750c (0x2abd0c88450c in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x317757b (0x2abd0c88457b in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x1e54ca5 (0x2abd00fa8ca5 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
  what():  CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 15.90 GiB total capacity; 1.05 GiB already allocated; 196.81 MiB free; 1.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at /tmp/coulombc/pytorch_build_2021-11-09_14-57-01/avx2/python3.7/pytorch/c10/cuda/CUDACachingAllocator.cpp:513 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x2ad79cc5c905 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x295bf (0x2ad79cbfe5bf in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2a2c5 (0x2ad79cbff2c5 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x2a7d2 (0x2ad79cbff7d2 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xef (0x2ad78393f52f in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x317703a (0x2ad783a8e03a in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x31bb12a (0x2ad783ad212a in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: at::meta::structured_max_pool2d_with_indices::meta(at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool) + 0x89e (0x2ad77784e0ee in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)

CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 15.90 GiB total capacity; 2.00 GiB already allocated; 344.81 MiB free; 2.03 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at /tmp/coulombc/pytorch_build_2021-11-09_14-57-01/avx2/python3.7/pytorch/c10/cuda/CUDACachingAllocator.cpp:513 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x2ba307a90905 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x295bf (0x2ba307a325bf in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2a2c5 (0x2ba307a332c5 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x2a7d2 (0x2ba307a337d2 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xef (0x2ba2ee77352f in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x317750c (0x2ba2ee8c250c in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x317757b (0x2ba2ee8c257b in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x1e54ca5 (0x2ba2e2fe6ca5 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)


terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError'
  what():  CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 15.90 GiB total capacity; 3.15 GiB already allocated; 100.81 MiB free; 3.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at /tmp/coulombc/pytorch_build_2021-11-09_14-57-01/avx2/python3.7/pytorch/c10/cuda/CUDACachingAllocator.cpp:513 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x2b01394d1905 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x295bf (0x2b01394735bf in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2a2c5 (0x2b01394742c5 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x2a7d2 (0x2b01394747d2 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xef (0x2b01201b452f in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x3082d6e (0x2b012020ed6e in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x96 (0x2b012020fa16 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x314f4e8 (0x2b01202db4e8 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x314f590 (0x2b01202db590 in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: at::_ops::cudnn_convolution::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x16f (0x2b01148ba99f in /project/6062137/vannary/CCTV/fastai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)

Am I getting this error because of an improper config? I have seen this error before, its because memory fragmentation of the GPU, in the past, the usual solution was to just restart the kernel and try to train the model with a smaller batch size. I’m not sure how I would do that in SLURM. TIA!

cc @sengv1

cc @muellerzr

1 Like

Can you show how you’re building your dataloaders and what the batch size being used is? This would help us with getting started!

Here is the major chunk of the code.

dir_path = settings.BASE_DIR
path = Path(dir_path)

settings.EPOCHS = 50
settings.BATCHSIZE = 64

def get_x(r): 
    return path/'train'/r['fname'] # return path/'r['fname']

def get_y(r): return r['labels'].split(' ')

set_seed(42, reproducible= True)
dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
                get_x=get_x, 
                get_y=get_y,
                item_tfms = Resize(224,method='squish'))
dls = dblock.dataloaders(df,bs=settings.BATCHSIZE)

print("length of train and valid data:")
print(len(dls.train),len(dls.valid))
print("---------------------------------")

f1score_multi_avg = F1ScoreMulti()
f1score_multi_avg.name = "F1 Average"
f1score_multi = FBetaMulti(1, average=None)
f1score_multi.name = "F1 Multi"
f2score_multi = FBetaMulti(2, average=None)
f2score_multi.name = "F2 Multi"

learn = vision_learner(dls, resnet50, metrics=[partial(accuracy_multi, thresh=0.5),
                                                f1score_multi_avg, f1score_multi, f2score_multi],
                        cbs=[WandbCallback()])

ideal_lr=learn.lr_find(show_plot= False)

with learn.distrib_ctx():
      learn.fine_tune(settings.EPOCHS, ideal_lr[0], freeze_epochs=4)

Hey @muellerzr, any update on this?

Hey deven, sorry for the delay. Out of curiosity can you try removing the wandb callback part of your code? Otherwise what can you tell me about your dataset size wise so I can best try to recreate this on my end w/o your dataset :slight_smile:

@muellerzr The actual dataset has over 1 million for training and around 130k for validation. You can use a smaller dls instead.

I’ll also remove the wandb callback and let you know.

1 Like