How to ensure that while running with llama2-70B, we use parallelism?

gildesh · August 18, 2023, 10:26am

I am trying to run finetuning on llama2-70B using this repo

intel/intel-extension-for-transformers/blob/main/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/finetune_clm.py

#!/usr/bin/env python
# coding=utf-8

# Apache v2 license
# Copyright (C) 2022 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import datasets

This file has been truncated. show original

but keep running into the error that “failed to allocate 448 MB”

f

regisss · August 18, 2023, 12:06pm

Hi @gildesh, this is a memory allocation error. Could you let me know the following?

the batch size you use
the world size (number of devices)
do you use DeepSpeed?

Please also copy paste your command here.

gildesh · August 21, 2023, 4:23am

python3 intelx/workflows/chatbot/utils/gaudi_spawn.py --world_size 2 --use_deepspeed intelx/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/finetune_clm.py --bf16 True --train_file merged_final_ultimate_andy.json --task completion --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 2000 --save_total_limit 1 --learning_rate 0.0001 --logging_steps 1 --do_train True --num_train_epochs 2 --log_level info --output_dir ./output/peft_model --peft lora --use_fast_tokenizer false --habana True --use_habana True --use_lazy_mode True --throughput_warmup_steps 3 --deepspeed /cnvrg/optimum-habana/gaudi_config.json

regisss · August 21, 2023, 7:01am

@gildesh Here the error is different: it seems you only have one available device. What’s the output of hl-smi?

gildesh · August 21, 2023, 7:36am

Yes, sorry for the confusion. We weer trying a lot of different things!
Would have to rerun, but assuming the output is just 1 (HPU), how to solve/

regisss · August 21, 2023, 7:50am

Llama-70B is a big model so it may not work at all on 1 device (is it Gaudi1 or Gaudi2?).

My recommendations was decreasing the memory footprint are the following:

Use gradient checkpointing with --gradient_checkpointing. This will likely slow down your run but at the benefit of a smaller memory consumption.
Decrease the size of your batches. You can then compensate with gradient_accumulation_steps if you want to keep the same global size as before.

Now, if you can access several devices, DeepSpeed can help you too. Could you show me your DeepSpeed configuration /cnvrg/optimum-habana/gaudi_config.json please?

gildesh · August 21, 2023, 9:51am

{
“steps_per_print”: 64,
“train_batch_size”: “auto”,
“train_micro_batch_size_per_gpu”: “auto”,
“gradient_accumulation_steps”: “auto”,
“bf16”: {
“enabled”: true
},
“gradient_clipping”: 1.0,
“zero_optimization”: {
“stage”: 1,
“overlap_comm”: false,
“reduce_scatter”: false,
“contiguous_gradients”: true
}
}

regisss · August 21, 2023, 10:09am

So you’re using ZeRO-1. You could use ZeRO-2 to save more memory: https://github.com/huggingface/optimum-habana/blob/main/tests/configs/deepspeed_zero_2.json

And maybe even ZeRO-3 for even larger gains: https://github.com/huggingface/optimum-habana/blob/main/examples/summarization/ds_flan_t5_z3_config_bf16.json

But you will save memory with DeepSpeed only if you use several devices (as gradients, model parameters and optimizer states are spread across devices).

DaniAtalla · August 22, 2023, 7:33am

Hi, I was wondering what comes after zero3 stage? And how do you recommend me to use it?
We have access to multi devices and they all are getting Memory Filled I would love to have some guideness on this.

regisss · August 22, 2023, 7:49am

Hi @DaniAtalla! To use DeepSpeed ZeRO-3, first install DeepSpeed with

pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.11.0

or maybe @1.10.0 as I see you’re using SynapseAI v1.10.0

And then run your script with

deepspeed --num_gpus 8 --no_local_rank my_script.py args --deepspeed deepspeed_config.json

using for example this DeepSpeed config: https://github.com/huggingface/optimum-habana/blob/main/examples/summarization/ds_flan_t5_z3_config_bf16.json

Then, if you want to train Llama 2 70B, you’ll probably need more than one Gaudi2 node.

DaniAtalla · August 22, 2023, 8:32am

Okay, so I have access to 3 nodes.
I will try it and Inform you.
Thanks for your help

regisss · August 22, 2023, 9:12am

@DaniAtalla Okay, so it could work then, difficult to say without trying it out. You may also want to check out this repo with guidelines for multi-node training: https://github.com/huggingface/optimum-habana/tree/main/examples/multi-node-training

Topic		Replies	Views
Error while Trying to run inference using gaudi on a finetuned llama2 model using habana repo 🤗Optimum	9	654	August 21, 2023
Finetuning LLama2-70B using 4-bit quantization on multi-GPU using Deepspeed ZeRO Intermediate	1	2421	March 19, 2024
Bad Performance Finetuning Llama Chat and Instruct Models on GSM8K Beginners	5	1105	December 5, 2024
Llama2-70b-chat loading Cuda Out of Memory Models	0	1215	July 26, 2023
Fine-tuning Llama-7B Models	2	10616	May 2, 2023

How to ensure that while running with llama2-70B, we use parallelism?

Related topics