CUDA out of memory during evaluation using two V100s (NC12)

I have moved my project from Google Colab to an Azure VM with two V100s (NC12_v3), but despite trying many different settings, I am always getting the below error during evaluation:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.46 GiB. GPU 0 has a total capacty of 15.87 GiB of which 3.54 GiB is free. Of the allocated memory 7.90 GiB is allocated by PyTorch, and 4.06 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The program was able to train in Colab and in the new Windows environment, but in both environments, it runs out of memory during evaluation. I was hoping having two GPUs would fix this issue. Note that the program does not immediately fail, but is able to get through a small portion of the validation split as shown here before throwing the error:

...
6%|▌         | 49/861 [15:49<4:21:17, 19.31s/it]{'loss': 2.3042, 'learning_rate': 0.00019998151785951448, 'epoch': 0.06}
  6%|▌         | 50/861 [16:08<4:21:03, 19.31s/it]{'loss': 2.2999, 'learning_rate': 0.00019997338607843075, 'epoch': 0.06}
  0%|          | 0/766 [00:00<?, ?it/s]
  0%|          | 2/766 [00:00<05:04,  2.51it/s]
  0%|          | 3/766 [00:01<07:15,  1.75it/s]
  1%|          | 4/766 [00:02<08:23,  1.51it/s]
  1%|          | 5/766 [00:03<09:03,  1.40it/s]
  1%|          | 6/766 [00:04<09:54,  1.28it/s]
  1%|          | 7/766 [00:04<10:05,  1.25it/s]
  1%|          | 8/766 [00:05<10:52,  1.16it/s]
  1%|          | 9/766 [00:06<10:58,  1.15it/s]
  1%|▏         | 10/766 [00:07<11:46,  1.07it/s]
...

Here is what I have tried:

  • I tried setting PYTORCH_CUDA_ALLOC_CONF to 128, 64, and 32, but it actually increases the memory on cuda:0.
  • The tutorial I initially followed used bnb_4bit_compute_dtype=torch.bfloat16, but I read that the GPUs I am using only supports torch.float16, but both give the same issue.
  • I tried lowering the LoraConfig r and lora_alpha values, but this also does not change anything.
  • As you can see my batch sizes are already at 1, and I have tried increasing the gradient_accumulation_steps to 8, but no changes.
  • I have eval_steps set to 50 for testing purposes, since it does fine up until the evaluation processes, and so I have this set at 50 for testing.
  • I have tried an optim value of paged_adamw_8bit, adafactor, as well as omitting this option to use the default value.
  • The training inputs are CSV values, where incomplete corresponds to incomplete JSON strings, and complete corresponds to the completed versions of the JSON strings. I had the completed JSON strings set to ~2000 tokens, to account for falcon-7b’s sequence length and tried lowering the values in the training data to ~1,500 tokens, but I am still getting the error. I could lower it more but I really don’t want to.

Here is the code:

# pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui
# pip install transformers==4.34.0
# pip install peft
# pip install accelerate
# pip install datasets
# pip install loralib
# pip install einops
# pip install sacrebleu
# pip install rouge
# pip install scipy

from help import *
from props import *
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
# I tried setting PYTORCH_CUDA_ALLOC_CONF to 128, 64, and 32, but it actually increase memory usage on cuda:0
# os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:32'
import torch
import transformers
from datasets import load_dataset
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training
)
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    EvalPrediction,
    Trainer,
    TrainingArguments
)
import sacrebleu
from rouge import Rouge
import numpy as np

torch.cuda.empty_cache()

base_model = 'tiiuae/falcon-7b'

bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_use_double_quant=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.float16  # I tried bfloat16 as well, but NC12_v3 supports float16 better, right?
)

model = AutoModelForCausalLM.from_pretrained(
	base_model,
	device_map="auto",
	trust_remote_code=True,
	quantization_config=bnb_config,
	cache_dir=hugging_face_cache_dir
)

tokenizer = AutoTokenizer.from_pretrained(base_model, cache_dir=hugging_face_cache_dir)

tokenizer.pad_token = tokenizer.eos_token

model.gradient_checkpointing_enable()

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
	r=16,  # I have tried lowering this down to 4
	lora_alpha=32,  # I have tried lowering this down to 8
	target_modules=["query_key_value"],
	lora_dropout=0.05,
	bias="none",
	task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

train_files =  # we get the list of CSV files...

validation_files =  # we get the list of CSV files...

dataset = load_dataset('csv',
					   data_files={"train": train_files, "validation": validation_files},
					   cache_dir=hugging_face_cache_dir)

def tokenize_function(examples):
	tokenized_inputs = tokenizer(examples['incomplete'], padding='max_length', truncation=True)
	tokenized_labels = tokenizer(examples['complete'], padding='max_length', truncation=True)
	return {
		'input_ids': tokenized_inputs['input_ids'],
		'attention_mask': tokenized_inputs['attention_mask'],
		'labels': tokenized_labels['input_ids']
	}

tokenized_dataset = dataset.map(tokenize_function, batched=True)

session_trainer_output_dir = os.path.join(trainer_output_dir, get_datetime_str())
training_args = TrainingArguments(
	per_device_train_batch_size=1,
	per_device_eval_batch_size=1,
	evaluation_strategy="steps",
	eval_steps=50,  # I am running out of memory at eval stage, so have lowered to 50 for testing purposes
	gradient_accumulation_steps=4,
	num_train_epochs=1,
	learning_rate=2e-4,
	fp16=True,
	save_total_limit=3,
	logging_steps=1,
	output_dir=session_trainer_output_dir,
	optim='paged_adamw_8bit',  # I have also tried adafactor
	lr_scheduler_type="cosine",
	warmup_ratio=0.05,
	remove_unused_columns=True
)

rouge_evaluator = Rouge()

def compute_metrics(eval_pred: EvalPrediction):
	predictions, labels = eval_pred
	decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
	labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
	decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
	bleu = sacrebleu.corpus_bleu(decoded_preds, [decoded_labels]).score
	scores = rouge_evaluator.get_scores(decoded_preds, decoded_labels, avg=True)
	return {"bleu": bleu, "rouge-l": scores["rouge-l"]["f"]}

trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=tokenized_dataset['train'],
	eval_dataset=tokenized_dataset['validation'],
	compute_metrics=compute_metrics,
	data_collator=transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
)

model.config.use_cache = False
trainer.train()

Here is this in case it is helpful (initially I had downloaded CUDA 12.3, but then downgraded to 12.1, but noticed that it still says CUDA Version 12.3 here, though when I check the version in the python training code it detects 12.1, so I doubt this is any issue?):

As shown below, here is where the GPU usage stays at during training, and it only spikes up as above during evaluation:

I am really hoping the answer is not to just bump up to NC24.
Thanks a lot for any advice.

A few realizations and follow-up questions:

  1. Even though I tried reducing the token lengths of my inputs from ~2000 to ~1500, i realize that having padding=‘max_length’ should make this have no affect, right?
  2. The evaluation phase is only occurring on cuda:0, as indicated by the screenshots. Is there a way to force evaluation to be allocated across both GPUs? I am going to try removing this line, os.environ["CUDA_VISIBLE_DEVICES"] = "0,1", as I have read that maybe this is overriding the default settings in a negative way.

Changing padding from max_length to True did lower the baseline memory utilization during training, which gives more room for the evaluation stage, but I am still running out of memory during evaluation.

But it is only using one GPU at the evaluation stage! Is there a way to get the evaluation to be allocated across both GPUs?