How to ensure that while running with llama2-70B, we use parallelism?

I am trying to run finetuning on llama2-70B using this repo

but keep running into the error that “failed to allocate 448 MB”


Hi @gildesh, this is a memory allocation error. Could you let me know the following?

  • the batch size you use
  • the world size (number of devices)
  • do you use DeepSpeed?

Please also copy paste your command here.

python3 intelx/workflows/chatbot/utils/ --world_size 2 --use_deepspeed intelx/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/ --bf16 True --train_file merged_final_ultimate_andy.json --task completion --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 2000 --save_total_limit 1 --learning_rate 0.0001 --logging_steps 1 --do_train True --num_train_epochs 2 --log_level info --output_dir ./output/peft_model --peft lora --use_fast_tokenizer false --habana True --use_habana True --use_lazy_mode True --throughput_warmup_steps 3 --deepspeed /cnvrg/optimum-habana/gaudi_config.json

@gildesh Here the error is different: it seems you only have one available device. What’s the output of hl-smi?

Yes, sorry for the confusion. We weer trying a lot of different things! :smiley:
Would have to rerun, but assuming the output is just 1 (HPU), how to solve/

Llama-70B is a big model so it may not work at all on 1 device (is it Gaudi1 or Gaudi2?).

My recommendations was decreasing the memory footprint are the following:

  • Use gradient checkpointing with --gradient_checkpointing. This will likely slow down your run but at the benefit of a smaller memory consumption.
  • Decrease the size of your batches. You can then compensate with gradient_accumulation_steps if you want to keep the same global size as before.

Now, if you can access several devices, DeepSpeed can help you too. Could you show me your DeepSpeed configuration /cnvrg/optimum-habana/gaudi_config.json please?

“steps_per_print”: 64,
“train_batch_size”: “auto”,
“train_micro_batch_size_per_gpu”: “auto”,
“gradient_accumulation_steps”: “auto”,
“bf16”: {
“enabled”: true
“gradient_clipping”: 1.0,
“zero_optimization”: {
“stage”: 1,
“overlap_comm”: false,
“reduce_scatter”: false,
“contiguous_gradients”: true

So you’re using ZeRO-1. You could use ZeRO-2 to save more memory:

And maybe even ZeRO-3 for even larger gains:

But you will save memory with DeepSpeed only if you use several devices (as gradients, model parameters and optimizer states are spread across devices).

Hi, I was wondering what comes after zero3 stage? And how do you recommend me to use it?
We have access to multi devices and they all are getting Memory Filled I would love to have some guideness on this.

Hi @DaniAtalla! To use DeepSpeed ZeRO-3, first install DeepSpeed with

pip install git+

or maybe @1.10.0 as I see you’re using SynapseAI v1.10.0

And then run your script with

deepspeed --num_gpus 8 --no_local_rank args --deepspeed deepspeed_config.json

using for example this DeepSpeed config:

Then, if you want to train Llama 2 70B, you’ll probably need more than one Gaudi2 node.

Okay, so I have access to 3 nodes.
I will try it and Inform you.
Thanks for your help

@DaniAtalla Okay, so it could work then, difficult to say without trying it out. You may also want to check out this repo with guidelines for multi-node training: