By Strategy, I mean DDP, Tensor Parallel, Model Parallel, Pipeline Parallel etc etc and more importantly, how to use that strategy in HF Trainer to increase max_len
I’m trying to train Phi-2 whose Memory footbrint is 1.7GBs
. I loaded the model with 4bit config, used paged_adam_8bit
with Grad checkpointing. I have 8*A10 GPUs with 24GB each but when I try to train the model, it fails to even reach length of 512. I’m using HuggingFace trainer. what can be done?
With Single GPU, I can run the below code on a batch of 2 with 2048 length with Peak GPU usage of 19624MiB
but with Multiple GPUs, it breaks at 512 length and Batch of 1
when I try to load it in device_map = "auto"
, Trainer throws error saying Can't train when model is in 8 bit in other device
Without that, using nvidia-smi
gives me the GPU0
memory utilised 22524MiB
while other 6 are just around 4384MiB
. I think the model is not loaded properly. Could someone please help.
Here is my code:
model_name = "microsoft/phi-2"
tokenizer = AutoTokenizer.from_pretrained(
model_name,
padding_side="left",
add_eos_token=True,
add_bos_token=True,
use_fast=False, # needed for now, should be fixed soon
)
tokenizer.pad_token = tokenizer.eos_token
bnb_config = BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16,
# bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, quantization_config=bnb_config,)
model.gradient_checkpointing_enable() #gradient checkpointing to save memory
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True) # Freeze base model layers and cast layernorm in fp32
lora_config = LoraConfig(
r=256,
lora_alpha=512,
target_modules=[
'q_proj','k_proj','v_proj','dense','fc1','fc2',], #print(model) will show the modules to use
bias="none",
lora_dropout=0.05,
task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config) # LORA
Here’s the training Code:
training_args = TrainingArguments(
output_dir='./results', # Output directory for checkpoints and predictions
overwrite_output_dir=True, # Overwrite the content of the output directory
per_device_train_batch_size=1, # Batch size for training
per_device_eval_batch_size=1, # Batch size for evaluation
gradient_accumulation_steps=1, # number of steps before optimizing
gradient_checkpointing=True, # Enable gradient checkpointing
gradient_checkpointing_kwargs={"use_reentrant": False},
warmup_steps=10, # Number of warmup steps
max_steps=5000, # Total number of training steps
num_train_epochs=3, # Number of training epochs
learning_rate=5e-5, # Learning rate
weight_decay=0.01, # Weight decay
optim="paged_adamw_8bit", #Keep the optimizer state and quantize it
bf16=True, #Use mixed precision training
#For logging and saving
logging_dir='./logs',
logging_strategy="steps",
logging_steps=10,
save_strategy="steps",
save_steps=100,
save_total_limit=2, # Limit the total number of checkpoints
evaluation_strategy="steps",
eval_steps=100,
load_best_model_at_end=True, # Load the best model at the end of training
report_to = 'wandb',
neftune_noise_alpha = 5,
)
trainer = Trainer(
model = model,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_val_dataset,
args=training_args,
)
#Disable cache to prevent warning, renable for inference
model.config.use_cache = False