I am trying to run finetuning on llama2-70B using this repo
but keep running into the error that “failed to allocate 448 MB”
f
I am trying to run finetuning on llama2-70B using this repo
but keep running into the error that “failed to allocate 448 MB”
Hi @gildesh, this is a memory allocation error. Could you let me know the following?
Please also copy paste your command here.
python3 intelx/workflows/chatbot/utils/gaudi_spawn.py --world_size 2 --use_deepspeed intelx/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/finetune_clm.py --bf16 True --train_file merged_final_ultimate_andy.json --task completion --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 2000 --save_total_limit 1 --learning_rate 0.0001 --logging_steps 1 --do_train True --num_train_epochs 2 --log_level info --output_dir ./output/peft_model --peft lora --use_fast_tokenizer false --habana True --use_habana True --use_lazy_mode True --throughput_warmup_steps 3 --deepspeed /cnvrg/optimum-habana/gaudi_config.json
@gildesh Here the error is different: it seems you only have one available device. What’s the output of hl-smi
?
Yes, sorry for the confusion. We weer trying a lot of different things!
Would have to rerun, but assuming the output is just 1 (HPU), how to solve/
Llama-70B is a big model so it may not work at all on 1 device (is it Gaudi1 or Gaudi2?).
My recommendations was decreasing the memory footprint are the following:
--gradient_checkpointing
. This will likely slow down your run but at the benefit of a smaller memory consumption.gradient_accumulation_steps
if you want to keep the same global size as before.Now, if you can access several devices, DeepSpeed can help you too. Could you show me your DeepSpeed configuration /cnvrg/optimum-habana/gaudi_config.json
please?
{
“steps_per_print”: 64,
“train_batch_size”: “auto”,
“train_micro_batch_size_per_gpu”: “auto”,
“gradient_accumulation_steps”: “auto”,
“bf16”: {
“enabled”: true
},
“gradient_clipping”: 1.0,
“zero_optimization”: {
“stage”: 1,
“overlap_comm”: false,
“reduce_scatter”: false,
“contiguous_gradients”: true
}
}
So you’re using ZeRO-1. You could use ZeRO-2 to save more memory: https://github.com/huggingface/optimum-habana/blob/main/tests/configs/deepspeed_zero_2.json
And maybe even ZeRO-3 for even larger gains: https://github.com/huggingface/optimum-habana/blob/main/examples/summarization/ds_flan_t5_z3_config_bf16.json
But you will save memory with DeepSpeed only if you use several devices (as gradients, model parameters and optimizer states are spread across devices).
Hi, I was wondering what comes after zero3 stage? And how do you recommend me to use it?
We have access to multi devices and they all are getting Memory Filled I would love to have some guideness on this.
Hi @DaniAtalla! To use DeepSpeed ZeRO-3, first install DeepSpeed with
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.11.0
or maybe @1.10.0
as I see you’re using SynapseAI v1.10.0
And then run your script with
deepspeed --num_gpus 8 --no_local_rank my_script.py args --deepspeed deepspeed_config.json
using for example this DeepSpeed config: https://github.com/huggingface/optimum-habana/blob/main/examples/summarization/ds_flan_t5_z3_config_bf16.json
Then, if you want to train Llama 2 70B, you’ll probably need more than one Gaudi2 node.
Okay, so I have access to 3 nodes.
I will try it and Inform you.
Thanks for your help
@DaniAtalla Okay, so it could work then, difficult to say without trying it out. You may also want to check out this repo with guidelines for multi-node training: https://github.com/huggingface/optimum-habana/tree/main/examples/multi-node-training