I’m using transformers==4.29.0, and running the train script with:
torchrun --nproc_per_node=8 train-script.py
I am unable to train a 3B parameter model with p4d.24xlarge due to out of memory errors, this instance has 8 GPUs with 40GB each for a total of 320 GB GPU memory. Is this normal? All 8 GPUs were filling up approx uniformly until they breached the memory limit. What can I change in my configuration? Note that I am loading the model as shown below, without explicitly making any GPU specifications (such as device_map=‘auto’).
The torchrun command seems to launch 8 python processes. All 8 of these processes start out by loading my pandas dataframe of the data, and they all separately tokenize the same data. This makes me concerned that the model is also being loaded 8 different times. The tokenizer stage certainly involves very large memory usage, beyond what I would consider reasonable for a tokenizer. Is torchrun supposed to automagically manipulate the loading of the model so that it only loads once and shards it properly?
model = AutoModelForCausalLM.from_pretrained(ckpt)
dataset = Dataset.from_pandas(df)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False
)
... tokenize dataset...
training_args = TrainingArguments(
output_dir="./results",
bf16=True,
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
gradient_accumulation_steps=8, # alpaca
evaluation_strategy="no", # alpaca
save_steps=10_000,
save_total_limit=1, # alpaca
learning_rate=2e-5, # alpaca
weight_decay=0.0, # alpaca
warmup_ratio=0.03, # alpaca
lr_scheduler_type='cosine', # alpaca
logging_steps=10,
fsdp="full_shard auto_wrap", # alpaca
fsdp_transformer_layer_cls_to_wrap='LlamaDecoderLayer',
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=tokenized_dataset,
)