LLama2 with accelerate issues

MY machine has.4 A100 gpus & I am trying to train llama2-7b-hf using LORA. Here are some machine details
nvcc --version (cuda version)
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

python3 -c “import torch; print(torch.version)”. (pytorch version)
2.0.1+cu117

When I run my trainer using accelerate (runs fine on single gpu) I get this error:
Traceback (most recent call last):
Traceback (most recent call last):
File “/home/paperspace/DigitalSynapse/models/run_models.py”, line 81, in
File “/home/paperspace/DigitalSynapse/models/run_models.py”, line 81, in
run_model()
File “/home/paperspace/DigitalSynapse/models/run_models.py”, line 77, in run_model
run_model()
File “/home/paperspace/DigitalSynapse/models/run_models.py”, line 77, in run_model
model.train_model(gradient_accum_steps=args.batch_size,
File “/home/paperspace/DigitalSynapse/models/reviews_model.py”, line 97, in train_model
model.train_model(gradient_accum_steps=args.batch_size,
File “/home/paperspace/DigitalSynapse/models/reviews_model.py”, line 97, in train_model
model, optimizer, dataloader, lr_scheduler = accelerator.prepare(
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1202, in prepare
model, optimizer, dataloader, lr_scheduler = accelerator.prepare(
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1202, in prepare
result = tuple(
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1203, in
result = tuple(
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1203, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1030, in _prepare_one
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1030, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1370, in prepare_model
return self.prepare_model(obj, device_placement=device_placement)
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1370, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(model, **kwargs)Traceback (most recent call last):

File “/home/paperspace/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 676, in init
File “/home/paperspace/DigitalSynapse/models/run_models.py”, line 81, in
model = torch.nn.parallel.DistributedDataParallel(model, **kwargs)
File “/home/paperspace/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 676, in init
run_model()
File “/home/paperspace/DigitalSynapse/models/run_models.py”, line 77, in run_model
model.train_model(gradient_accum_steps=args.batch_size,
File “/home/paperspace/DigitalSynapse/models/reviews_model.py”, line 97, in train_model
model, optimizer, dataloader, lr_scheduler = accelerator.prepare(
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1202, in prepare
result = tuple(
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1203, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1030, in _prepare_one
_sync_module_states(
File “/home/paperspace/.local/lib/python3.9/site-packages/torch/distributed/utils.py”, line 142, in _sync_module_states
_sync_module_states(
return self.prepare_model(obj, device_placement=device_placement)
File “/home/paperspace/.local/lib/python3.9/site-packages/torch/distributed/utils.py”, line 142, in _sync_module_states
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1370, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(model, **kwargs)
File “/home/paperspace/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 676, in init
_sync_module_states(
File “/home/paperspace/.local/lib/python3.9/site-packages/torch/distributed/utils.py”, line 142, in _sync_module_states
_sync_params_and_buffers(
File “/home/paperspace/.local/lib/python3.9/site-packages/torch/distributed/utils.py”, line 160, in _sync_params_and_buffers
_sync_params_and_buffers(_sync_params_and_buffers(

File “/home/paperspace/.local/lib/python3.9/site-packages/torch/distributed/utils.py”, line 160, in _sync_params_and_buffers
File “/home/paperspace/.local/lib/python3.9/site-packages/torch/distributed/utils.py”, line 160, in _sync_params_and_buffers
dist._broadcast_coalesced(
dist._broadcast_coalesced(dist._broadcast_coalesced(

RuntimeErrorRuntimeErrorRuntimeError: : Invalid scalar type: Invalid scalar type
Invalid scalar type

Traceback (most recent call last):
File “/home/paperspace/DigitalSynapse/models/run_models.py”, line 81, in
run_model()
File “/home/paperspace/DigitalSynapse/models/run_models.py”, line 77, in run_model
model.train_model(gradient_accum_steps=args.batch_size,
File “/home/paperspace/DigitalSynapse/models/reviews_model.py”, line 97, in train_model
model, optimizer, dataloader, lr_scheduler = accelerator.prepare(
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1202, in prepare
result = tuple(
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1203, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1030, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/accelerator.py”, line 1370, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(model, **kwargs)
File “/home/paperspace/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 676, in init
_sync_module_states(
File “/home/paperspace/.local/lib/python3.9/site-packages/torch/distributed/utils.py”, line 142, in _sync_module_states
_sync_params_and_buffers(
File “/home/paperspace/.local/lib/python3.9/site-packages/torch/distributed/utils.py”, line 160, in _sync_params_and_buffers
dist._broadcast_coalesced(
RuntimeError: Invalid scalar type
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3824) of binary: /usr/bin/python3.9
Traceback (most recent call last):
File “/home/paperspace/.local/bin/accelerate”, line 8, in
sys.exit(main())
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py”, line 45, in main
args.func(args)
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/commands/launch.py”, line 970, in launch_command
multi_gpu_launcher(args)
File “/home/paperspace/.local/lib/python3.9/site-packages/accelerate/commands/launch.py”, line 646, in multi_gpu_launcher
distrib_run.run(args)
File “/home/paperspace/.local/lib/python3.9/site-packages/torch/distributed/run.py”, line 785, in run
elastic_launch(
File “/home/paperspace/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/paperspace/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Please help. If this is not the correct place to ask questions it would be awesome if someone can point me in the correct direction

1 Like

Did you find a solution for this error?

not yet… do you have any ideas?

I think I have finally figured this out. The error is because (this is my guess) the layer_norm layer is getting sharded which makes it hard to compute the norms. Fixed this with a few lines of code:

from accelerate import Accelerator, dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory
# These are the layer class names in the model code. Had to dig these up.
no_split_module_classes =
  ["LlamaDecoderLayer", "LlamaAttention", "Linear", "LlamaMLP",  "LlamaRMSNorm"]
max_memory = get_balanced_memory(
      model,
      max_memory=None,
      no_split_module_classes=no_split_module_classes,
      dtype=torch.bfloat16,
    )
device_map = infer_auto_device_map(
      model=model,
      max_memory=max_memory,
      no_split_module_classes=no_split_module_classes,
      dtype=torch.bfloat16,
    )
model = dispatch_model(model, device_map=device_map)
# Also when calling accelerator.prepare do not pass model to it, like so:
optimizer, dataloader, lr_scheduler = accelerator.prepare(
      optimizer, dataloader, lr_scheduler
    )