Gradient Checkpointing with FSDP efficiency

StringChaos · August 20, 2023, 5:08am

I have been fine-tuning a llama-style model on 8 40GB A100 GPUs with flash attention and FSDP. I was trying training argument combinations and found that turning off gradient checkpointing actually slowed the training throughput which was very surprising. I was curious about why this might happen since gradient checkpointing requires re-computations! (I suspect this might be caused by FSDP since for smaller models turning gradient checkpointing off improves throughput)

Topic		Replies	Views
Gradient checkpointing + FSDP 🤗Accelerate	1	2559	August 22, 2023
Gradient checkpointing without training Beginners	0	239	July 18, 2023
Gradient_checkpointing control 🤗Transformers	0	1078	August 10, 2023
No benefit from turning on gradient_checkpointing: True 🤗Transformers	1	163	October 24, 2024
Accuracy drops using Gradient checkpointing 🤗Transformers	0	149	September 7, 2023

Gradient Checkpointing with FSDP efficiency

Related topics