Hello everyone, I have been spending quite a while trying to figure out which was the best option to distribuite a Llama3.1-70B over 4 GPUs within a single machine. The code I was able to write is the following:
if num_gpus > available_gpus:
raise ValueError(f"Requested {num_gpus} GPUs, but only {available_gpus} are available.")
# Determine whether to use GPU or CPU
use_gpu = num_gpus > 0 and torch.cuda.is_available()
# If it doesn't fit also in CPU, it will be offloaded to the hard drive
max_memory = {"cpu": cpu_memory}
gc.collect() # Free up memory before loading the model
if use_gpu:
max_memory = {i: gpu_memory for i in range(num_gpus)}
print(f"Using GPUs: {gpu_ids}")
torch.cuda.empty_cache() # Free up CUDA cache in case it's needed
# Download the model repository from the Hugging Face Hub
model_location = snapshot_download(model_name, token=hf_token)
# Initialize the model with empty weights to avoid going out of memory
config = AutoConfig.from_pretrained(model_location)
with init_empty_weights():
model_weights = AutoModelForCausalLM.from_config(config)
if use_gpu:
# Infer the device map based on available memory for GPU
device_map = infer_auto_device_map(model_weights, max_memory=max_memory,
no_split_module_classes=unsplittable_layers)
model_weights = load_checkpoint_and_dispatch(model_weights, model_location, device_map=device_map,
offload_folder="offload")
else:
# Load model directly on CPU
model_weights = AutoModelForCausalLM.from_pretrained(model_location)
print("Model loaded successfully.")
I launch the code with:
accelerate launch script.py
This is my accelerate config:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: 'cpu'
offload_param_device: 'cpu'
zero3_init_flag: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
This code throws an OOM error, although the 4 GPUs have enough space.
The error is weird because happens regardless of the model size (both with 8B and 70B). In particular:
- The 8B works fine with my script without launching with accelerate, but when I use the latter an OOM occurs.
- The 70B doesn’t work whether I launch the script with accelerate or not.
My questions are
- is it right to use simultaneously
load_checkpoint_and_dispatch
andaccelerate launch
, or is it redundant? - How can the system go in OOM if I am using DeepSpeed stage 3 with offload to CPU? It should be slower, but at least it should work properly, shouldn’t it?
Thanks for the help