Accelerate throws CUDA: OOM

Hello everyone, I have been spending quite a while trying to figure out which was the best option to distribuite a Llama3.1-70B over 4 GPUs within a single machine. The code I was able to write is the following:

if num_gpus > available_gpus:
    raise ValueError(f"Requested {num_gpus} GPUs, but only {available_gpus} are available.")

# Determine whether to use GPU or CPU
use_gpu = num_gpus > 0 and torch.cuda.is_available()

# If it doesn't fit also in CPU, it will be offloaded to the hard drive
max_memory = {"cpu": cpu_memory}
gc.collect()  # Free up memory before loading the model

if use_gpu:
    max_memory = {i: gpu_memory for i in range(num_gpus)}
    print(f"Using GPUs: {gpu_ids}")
    torch.cuda.empty_cache()  # Free up CUDA cache in case it's needed

# Download the model repository from the Hugging Face Hub
model_location = snapshot_download(model_name, token=hf_token)

# Initialize the model with empty weights to avoid going out of memory
config = AutoConfig.from_pretrained(model_location)
with init_empty_weights():
    model_weights = AutoModelForCausalLM.from_config(config)

if use_gpu:
    # Infer the device map based on available memory for GPU
    device_map = infer_auto_device_map(model_weights, max_memory=max_memory,
                                       no_split_module_classes=unsplittable_layers)

    model_weights = load_checkpoint_and_dispatch(model_weights, model_location, device_map=device_map,
                                                 offload_folder="offload")
else:
    # Load model directly on CPU
    model_weights = AutoModelForCausalLM.from_pretrained(model_location)

    print("Model loaded successfully.")

I launch the code with:

accelerate launch script.py

This is my accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: 'cpu'
  offload_param_device: 'cpu'
  zero3_init_flag: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

This code throws an OOM error, although the 4 GPUs have enough space.

The error is weird because happens regardless of the model size (both with 8B and 70B). In particular:

  1. The 8B works fine with my script without launching with accelerate, but when I use the latter an OOM occurs.
  2. The 70B doesn’t work whether I launch the script with accelerate or not.

My questions are

  1. is it right to use simultaneously load_checkpoint_and_dispatch and accelerate launch, or is it redundant?
  2. How can the system go in OOM if I am using DeepSpeed stage 3 with offload to CPU? It should be slower, but at least it should work properly, shouldn’t it?

Thanks for the help :blush:

2 Likes