Out of memory training 3B param model on 8 GPU (320GB memory) with FSDP

narai · May 31, 2023, 10:38pm

I’m using transformers==4.29.0, and running the train script with:

torchrun --nproc_per_node=8 train-script.py

I am unable to train a 3B parameter model with p4d.24xlarge due to out of memory errors, this instance has 8 GPUs with 40GB each for a total of 320 GB GPU memory. Is this normal? All 8 GPUs were filling up approx uniformly until they breached the memory limit. What can I change in my configuration? Note that I am loading the model as shown below, without explicitly making any GPU specifications (such as device_map=‘auto’).

The torchrun command seems to launch 8 python processes. All 8 of these processes start out by loading my pandas dataframe of the data, and they all separately tokenize the same data. This makes me concerned that the model is also being loaded 8 different times. The tokenizer stage certainly involves very large memory usage, beyond what I would consider reasonable for a tokenizer. Is torchrun supposed to automagically manipulate the loading of the model so that it only loads once and shards it properly?

model = AutoModelForCausalLM.from_pretrained(ckpt)

dataset = Dataset.from_pandas(df)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm=False
)
... tokenize dataset...

training_args = TrainingArguments(
    output_dir="./results", 
    bf16=True,
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8, # alpaca
    evaluation_strategy="no", # alpaca
    save_steps=10_000,
    save_total_limit=1, # alpaca
    learning_rate=2e-5, # alpaca
    weight_decay=0.0, # alpaca
    warmup_ratio=0.03, # alpaca
    lr_scheduler_type='cosine', # alpaca
    logging_steps=10,
    fsdp="full_shard auto_wrap", # alpaca 
    fsdp_transformer_layer_cls_to_wrap='LlamaDecoderLayer',
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

ArslanTu · July 28, 2023, 10:19am

I also tried “full_shard” in this way, and the CUDA memory usage showed that every GPU just load a full model. So I guss this is not the right way to implement model parameters parallel. Unfortunately, I don’t know the right way. What’s more, as I know for now, model parallel is not a good idea, because it slow down our training speed. I recommend you to use int8 + peft to save CUDA memory.

Finally, I‘m a beginner in NLP, please forgive any mistakes in what I said.

Topic		Replies	Views
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 39.56 GiB total capacity; 37.84 GiB already allocated; 242.56 MiB free; 37.96 GiB reserved in total by PyTorch) 🤗Transformers	2	5347	June 7, 2023
Repeated training runs out of GPU memory 🤗Transformers	3	254	December 16, 2024
Running into cuda out of memory when running llama2-13b-chat model on multi-gpu machine Intermediate	5	11049	December 21, 2023
Cuda Out of Memory when fine tuning llm model 🤗Transformers	3	1164	May 7, 2024
Missmatch between memory-estimate and Trainer-API Beginners	0	182	January 23, 2024

Out of memory training 3B param model on 8 GPU (320GB memory) with FSDP

Related topics