Goal
Continue pretraining of the meta/llama2-7b-hf
transformer on custom text data.
Software Approach
datasets 2.13
to load dataTrainer
fromtransformers 4.32.0.dev0
for trainingdeepspeed 1.10
for multi-gpu training
Hardware Details
- 1 Machine
- either 4x Nvidia V100 (32G)
- or 8x Nvidia GTX 2080 TI (11GB)
Problem
-
Code exits in ZeRO Stage 2 due to
OOM
of 32GB for each GPU -
Code exits in ZeRO Stage 3 due to
OOM
with 32GB for each GPU -
Code exits in ZeRO Stage 2 + CPU Offload with
OOM
if CPU Memory is below178GB
-
Code exits in ZeRO Stage 3 + CPU Offload with
OOM
if CPU Memory is below178GB
-
Code exits eventually with Stage 2/3 + CPU Offload due to loss scale reduced to
1
(CPU Ram set to256GB
-
Code exists in any Stage with
OOM
iftrain_batch_size > 1
Questions
Why is every GPU fully used up when using Stage2/3 without CPU Offload and ends in an OOM?
Why do I need > 172GB
memory allocated to my SLURM process to use Stage 2/3 + CPU Offloading
Why does the loss scale rapidly decrease from 65536
to 1
and eventually stop, and how can I prevent this?
Why am I forced to use a micro train batch size of 1 in any configuration to not encounter OOM.
Setup
Training code exactly follows CLM-Training Example from the Transformers Github.
Slurm Script used to start the job (some information are redacted):
#!/bin/bash
#SBATCH --job-name=llama2-7b-3gpu # name
#SBATCH --nodes=1 # nodes
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=8
#SBATCH --partition=clara
#SBATCH --mem=256G
#SBATCH --gres=gpu:v100:3 # number of gpus
#SBATCH --output=logs/%x-%j.out # output file name
#SBATCH --mail-type=ALL
module load Python
module load PyTorch
source .env/bin/activate
srun pip install git+https://github.com/huggingface/transformers
srun pip install typing-extensions
srun pip install datasets
srun pip install torch
srun pip install accelerate
srun pip install pytest
srun pip install scikit-learn
srun pip install evaluate
srun pip install sentencepiece
srun pip install deepspeed
# Torch Distributed Run Variables
export GPUS_PER_NODE=3
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=9907
export NCCL_DEBUG=INFO
export NCCL_IGNORE_DISABLED_P2P=1
# Model Arguments
MODEL_NAME=meta-llama/Llama-2-7b-hf
CACHE_DIR=./cache
USE_FAST_TOKENIZER=false
MODEL_REVISION=main
USE_AUTH_TOKEN=true
HUGGING_TOKEN=<REDACTED>
TORCH_DTYPE=auto
LOW_CPU_MEM_USAGE=false
# DataTraining Arguments
TRAIN_FILE=./input/health_information_systems_epub.md
MAX_TRAIN_SAMPLES=None
OVERWRITE_CACHE=false
VALIDATION_SPLIT_PERCENTAGE=5
PREPROCESSING_NUM_WORKERS=1
KEEP_LINEBREAKS=true
# Training Arguments
OUTPUT_DIR=./trained/7B
OVERWRITE_OUTPUT_DIR=true
DO_TRAIN=true
DO_EVAL=false
PER_DEVICE_TRAIN_BATCH_SIZE=1
PER_DEVICE_EVAL_BATCH_SIZE=1
BLOCK_SIZE=4096
EVALUATION_STRATEGY=steps
EVAL_STEPS=100
LEARNING_RATE=3e-4
WEIGHT_DECAY=0.1
ADAM_BETA1=0.9
ADAM_BETA2=0.95
ADAM_EPSILON=1e-5
MAX_GRAD_NORM=1.0
NUM_TRAIN_EPOCHS=3
LR_SCHEDULER_TYPE=cosine
WARMUP_STEPS=0
LOG_LEVEL=passive
SAVE_STRATEGY=steps
SAVE_STEPS=500
SAVE_TOTAL_LIMIT=1
NO_CUDA=false
SEED=42
FP16=false
BF16=false
HALF_PRECISION_BACKEND=auto
DDP_BACKEND=nccl
DEEPSPEED=./ds_configs/stage2_offload.json
OPTIM=adamw_torch
echo "srun --jobid $SLURM_JOBID bash -c \"NCCL_DEBUG=INFO deepspeed "
echo "--num_gpus=$GPUS_PER_NODE "
echo "03_train_llama2.py "
echo "--model_name $MODEL_NAME "
echo "--cache_dir $CACHE_DIR "
echo "--use_fast_tokenizer $USE_FAST_TOKENIZER "
echo "--model_revision $MODEL_REVISION "
echo "--use_auth_token $USE_AUTH_TOKEN "
echo "--hugging_token $HUGGING_TOKEN "
echo "--torch_dtype $TORCH_DTYPE "
echo "--low_cpu_mem_usage $LOW_CPU_MEM_USAGE "
echo "--train_file $TRAIN_FILE "
echo "--max_train_samples $MAX_TRAIN_SAMPLES "
echo "--overwrite_cache $OVERWRITE_CACHE "
echo "--validation_split_percentage $VALIDATION_SPLIT_PERCENTAGE "
echo "--preprocessing_num_workers $PREPROCESSING_NUM_WORKERS "
echo "--keep_linebreaks $KEEP_LINEBREAKS "
echo "--output_dir $OUTPUT_DIR "
echo "--overwrite_output_dir $OVERWRITE_OUTPUT_DIR "
echo "--do_train $DO_TRAIN "
echo "--do_eval $DO_EVAL "
echo "--per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE "
echo "--per_device_eval_batch_size $PER_DEVICE_EVAL_BATCH_SIZE "
echo "--block_size $BLOCK_SIZE "
echo "--evaluation_strategy $EVALUATION_STRATEGY "
echo "--eval_steps $EVAL_STEPS "
echo "--learning_rate $LEARNING_RATE "
echo "--weight_decay $WEIGHT_DECAY "
echo "--adam_beta1 $ADAM_BETA1 "
echo "--adam_beta2 $ADAM_BETA2 "
echo "--adam_epsilon $ADAM_EPSILON "
echo "--max_grad_norm $MAX_GRAD_NORM "
echo "--num_train_epochs $NUM_TRAIN_EPOCHS "
echo "--lr_scheduler_type $LR_SCHEDULER_TYPE "
echo "--warmup_steps $WARMUP_STEPS "
echo "--log_level $LOG_LEVEL "
echo "--save_strategy $SAVE_STRATEGY "
echo "--save_steps $SAVE_STEPS "
echo "--save_total_limit $SAVE_TOTAL_LIMIT "
echo "--no_cuda $NO_CUDA "
echo "--seed $SEED "
echo "--fp16 $FP16 "
echo "--bf16 $BF16 "
echo "--half_precision_backend $HALF_PRECISION_BACKEND "
echo "--local_rank $SLURM_PROCID "
echo "--ddp_backend $DDP_BACKEND "
echo "--deepspeed $DEEPSPEED "
echo "--optim $OPTIM\""
srun --jobid $SLURM_JOBID bash -c "NCCL_DEBUG=INFO deepspeed \
--num_gpus=$GPUS_PER_NODE \
03_train_llama2.py \
--model_name $MODEL_NAME \
--cache_dir $CACHE_DIR \
--use_fast_tokenizer $USE_FAST_TOKENIZER \
--model_revision $MODEL_REVISION \
--use_auth_token $USE_AUTH_TOKEN \
--hugging_token $HUGGING_TOKEN \
--torch_dtype $TORCH_DTYPE \
--low_cpu_mem_usage $LOW_CPU_MEM_USAGE \
--train_file $TRAIN_FILE \
--max_train_samples $MAX_TRAIN_SAMPLES \
--overwrite_cache $OVERWRITE_CACHE \
--validation_split_percentage $VALIDATION_SPLIT_PERCENTAGE \
--preprocessing_num_workers $PREPROCESSING_NUM_WORKERS \
--keep_linebreaks $KEEP_LINEBREAKS \
--output_dir $OUTPUT_DIR \
--overwrite_output_dir $OVERWRITE_OUTPUT_DIR \
--do_train $DO_TRAIN \
--do_eval $DO_EVAL \
--per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
--per_device_eval_batch_size $PER_DEVICE_EVAL_BATCH_SIZE \
--block_size $BLOCK_SIZE \
--evaluation_strategy $EVALUATION_STRATEGY \
--eval_steps $EVAL_STEPS \
--learning_rate $LEARNING_RATE \
--weight_decay $WEIGHT_DECAY \
--adam_beta1 $ADAM_BETA1 \
--adam_beta2 $ADAM_BETA2 \
--adam_epsilon $ADAM_EPSILON \
--max_grad_norm $MAX_GRAD_NORM \
--num_train_epochs $NUM_TRAIN_EPOCHS \
--lr_scheduler_type $LR_SCHEDULER_TYPE \
--warmup_steps $WARMUP_STEPS \
--log_level $LOG_LEVEL \
--save_strategy $SAVE_STRATEGY \
--save_steps $SAVE_STEPS \
--save_total_limit $SAVE_TOTAL_LIMIT \
--no_cuda $NO_CUDA \
--seed $SEED \
--fp16 $FP16 \
--bf16 $BF16 \
--half_precision_backend $HALF_PRECISION_BACKEND \
--local_rank $SLURM_PROCID \
--ddp_backend $DDP_BACKEND \
--optim $OPTIM \
--deepspeed $DEEPSPEED "
DeepSpeed Configuration:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"allgather_bucket_size": 2e8,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
}
},
"gradient_clipping": 1.0,
"steps_per_print": 500,
"wall_clock_breakdown": false,
"train_micro_batch_size_per_gpu": "auto"
}
Logs
DeepSpeed ds_report output (some output redacted):
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO]....... [NO]
cpu_adagrad ............ [NO]....... [OKAY]
cpu_adam ............... [NO]....... [OKAY]
fused_adam ............. [NO]....... [OKAY]
fused_lamb ............. [NO]....... [OKAY]
quantizer .............. [NO]....... [OKAY]
random_ltd ............. [NO]....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO]....... [NO]
spatial_inference ...... [NO]....... [OKAY]
transformer ............ [NO]....... [OKAY]
stochastic_transformer . [NO]....... [OKAY]
transformer_inference .. [NO]....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch']
torch version .................... 1.12.1
deepspeed install path ........... ['/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.7
Error Logs for Loss Scale:
[2023-07-22 15:59:32,657] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2023-07-22 15:59:37,102] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
[2023-07-22 15:59:41,552] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
[2023-07-22 15:59:45,994] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192
[2023-07-22 15:59:50,438] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096
[2023-07-22 16:00:44,332] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048
[2023-07-22 16:01:00,722] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, reducing to 1024
0%| | 0/102 [00:00<?, ?it/s]
1%| | 1/102 [00:05<09:58, 5.93s/it]
2%|â | 2/102 [00:10<08:25, 5.06s/it]
...
47%|âââââ | 48/102 [07:31<04:15, 4.74s/it]Traceback (most recent call last):
File "/home/<REDACTED>/LLaMA_Training/03_train_llama2.py", line 444, in <module>
main()
File "/home/<REDACTED>LLaMA_Training/03_train_llama2.py", line 410, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 1532, in train
return inner_training_loop(
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 1805, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 2659, in training_step
self.accelerator.backward(loss)
File "/home/<REDACTED>LLaMA_Training/.env/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 176, in backward
self.engine.step()
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2087, in step
self._take_model_step(lr_kwargs)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1994, in _take_model_step
self.optimizer.step()
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1662, in step
self._update_scale(self.overflow)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1908, in _update_scale
self.loss_scaler.update_scale(has_overflow)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
Error Logs for Stage 2/3 without CPU Offload OOM:
0%| | 0/51 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/<REDACTED>/LLaMA_Training/03_train_llama2.py", line 444, in <module>
main()
File "/home/<REDACTED>/LLaMA_Training/03_train_llama2.py", line 410, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 1532, in train
return inner_training_loop(
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 1805, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 2648, in training_step
loss = self.compute_loss(model, inputs)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/trainer.py", line 2673, in compute_loss
outputs = model(**inputs)
File "/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1769, in forward
loss = self.module(*inputs, **kwargs)
File "/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 810, in forward
outputs = self.model(
File "/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 698, in forward
layer_outputs = decoder_layer(
File "/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 413, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 322, in forward
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
File "/home/<REDACTED>/LLaMA_Training/.env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 186, in apply_rotary_pos_emb
q_embed = (q * cos) + (rotate_half(q) * sin)
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 31.75 GiB total capacity; 30.62 GiB already allocated; 15.69 MiB free; 30.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Any Ideas, hints, approaches, Collab Notebooks you can provide are much appreciated!
I am working on this in context of my master thesis. This project working or failing kind of decides if I get my degree or not.