I have moved my project from Google Colab to an Azure VM with two V100s (NC12_v3), but despite trying many different settings, I am always getting the below error during evaluation:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.46 GiB. GPU 0 has a total capacty of 15.87 GiB of which 3.54 GiB is free. Of the allocated memory 7.90 GiB is allocated by PyTorch, and 4.06 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The program was able to train in Colab and in the new Windows environment, but in both environments, it runs out of memory during evaluation. I was hoping having two GPUs would fix this issue. Note that the program does not immediately fail, but is able to get through a small portion of the validation split as shown here before throwing the error:
...
6%|▌ | 49/861 [15:49<4:21:17, 19.31s/it]{'loss': 2.3042, 'learning_rate': 0.00019998151785951448, 'epoch': 0.06}
6%|▌ | 50/861 [16:08<4:21:03, 19.31s/it]{'loss': 2.2999, 'learning_rate': 0.00019997338607843075, 'epoch': 0.06}
0%| | 0/766 [00:00<?, ?it/s]
0%| | 2/766 [00:00<05:04, 2.51it/s]
0%| | 3/766 [00:01<07:15, 1.75it/s]
1%| | 4/766 [00:02<08:23, 1.51it/s]
1%| | 5/766 [00:03<09:03, 1.40it/s]
1%| | 6/766 [00:04<09:54, 1.28it/s]
1%| | 7/766 [00:04<10:05, 1.25it/s]
1%| | 8/766 [00:05<10:52, 1.16it/s]
1%| | 9/766 [00:06<10:58, 1.15it/s]
1%|▏ | 10/766 [00:07<11:46, 1.07it/s]
...
Here is what I have tried:
- I tried setting
PYTORCH_CUDA_ALLOC_CONF
to 128, 64, and 32, but it actually increases the memory oncuda:0
. - The tutorial I initially followed used
bnb_4bit_compute_dtype=torch.bfloat16
, but I read that the GPUs I am using only supportstorch.float16
, but both give the same issue. - I tried lowering the
LoraConfig
r
andlora_alpha
values, but this also does not change anything. - As you can see my batch sizes are already at 1, and I have tried increasing the
gradient_accumulation_steps
to 8, but no changes. - I have
eval_steps
set to 50 for testing purposes, since it does fine up until the evaluation processes, and so I have this set at 50 for testing. - I have tried an
optim
value ofpaged_adamw_8bit
,adafactor
, as well as omitting this option to use the default value. - The training inputs are CSV values, where incomplete corresponds to incomplete JSON strings, and complete corresponds to the completed versions of the JSON strings. I had the completed JSON strings set to ~2000 tokens, to account for falcon-7b’s sequence length and tried lowering the values in the training data to ~1,500 tokens, but I am still getting the error. I could lower it more but I really don’t want to.
Here is the code:
# pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui
# pip install transformers==4.34.0
# pip install peft
# pip install accelerate
# pip install datasets
# pip install loralib
# pip install einops
# pip install sacrebleu
# pip install rouge
# pip install scipy
from help import *
from props import *
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
# I tried setting PYTORCH_CUDA_ALLOC_CONF to 128, 64, and 32, but it actually increase memory usage on cuda:0
# os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:32'
import torch
import transformers
from datasets import load_dataset
from peft import (
LoraConfig,
get_peft_model,
prepare_model_for_kbit_training
)
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
EvalPrediction,
Trainer,
TrainingArguments
)
import sacrebleu
from rouge import Rouge
import numpy as np
torch.cuda.empty_cache()
base_model = 'tiiuae/falcon-7b'
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16 # I tried bfloat16 as well, but NC12_v3 supports float16 better, right?
)
model = AutoModelForCausalLM.from_pretrained(
base_model,
device_map="auto",
trust_remote_code=True,
quantization_config=bnb_config,
cache_dir=hugging_face_cache_dir
)
tokenizer = AutoTokenizer.from_pretrained(base_model, cache_dir=hugging_face_cache_dir)
tokenizer.pad_token = tokenizer.eos_token
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16, # I have tried lowering this down to 4
lora_alpha=32, # I have tried lowering this down to 8
target_modules=["query_key_value"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
train_files = # we get the list of CSV files...
validation_files = # we get the list of CSV files...
dataset = load_dataset('csv',
data_files={"train": train_files, "validation": validation_files},
cache_dir=hugging_face_cache_dir)
def tokenize_function(examples):
tokenized_inputs = tokenizer(examples['incomplete'], padding='max_length', truncation=True)
tokenized_labels = tokenizer(examples['complete'], padding='max_length', truncation=True)
return {
'input_ids': tokenized_inputs['input_ids'],
'attention_mask': tokenized_inputs['attention_mask'],
'labels': tokenized_labels['input_ids']
}
tokenized_dataset = dataset.map(tokenize_function, batched=True)
session_trainer_output_dir = os.path.join(trainer_output_dir, get_datetime_str())
training_args = TrainingArguments(
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
evaluation_strategy="steps",
eval_steps=50, # I am running out of memory at eval stage, so have lowered to 50 for testing purposes
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4,
fp16=True,
save_total_limit=3,
logging_steps=1,
output_dir=session_trainer_output_dir,
optim='paged_adamw_8bit', # I have also tried adafactor
lr_scheduler_type="cosine",
warmup_ratio=0.05,
remove_unused_columns=True
)
rouge_evaluator = Rouge()
def compute_metrics(eval_pred: EvalPrediction):
predictions, labels = eval_pred
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
bleu = sacrebleu.corpus_bleu(decoded_preds, [decoded_labels]).score
scores = rouge_evaluator.get_scores(decoded_preds, decoded_labels, avg=True)
return {"bleu": bleu, "rouge-l": scores["rouge-l"]["f"]}
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset['train'],
eval_dataset=tokenized_dataset['validation'],
compute_metrics=compute_metrics,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
)
model.config.use_cache = False
trainer.train()
Here is this in case it is helpful (initially I had downloaded CUDA 12.3, but then downgraded to 12.1, but noticed that it still says CUDA Version 12.3 here, though when I check the version in the python training code it detects 12.1, so I doubt this is any issue?):
As shown below, here is where the GPU usage stays at during training, and it only spikes up as above during evaluation:
I am really hoping the answer is not to just bump up to NC24.
Thanks a lot for any advice.