Tips for training LongT5

artidoro · June 29, 2022, 10:33pm

TL;DR: Use --gradient_checkpointing when training LongT5 with long documents if encountering memory issues.

While experimenting with LongT5 with large source document (max_source_length=16384), I was running into memory issues. For reference, I am using 8xA100 GPUs (with 40Gb memory). I did not manage to train the “large” version of the model on max_source_length>4000 (even using deepspeed did not help much). I then found that using the trainer hyperparameter --gradient_checkpointing I was able to run LongT5 large with max_source_length=16384. Deepspeed (stage2 and 3) did not seem to help on top of that in terms of training speed despite allowing for slightly greater batch sizes (note I am not an expert with deepspeed so I was mostly used “auto” settings to configure deepspeed).

Topic		Replies	Views
Fine-tuning T5 with long sequence length, using activation checkpointing with Deepspeed 🤗Transformers	6	2893	December 5, 2022
Huggingface longformer memory issues 🤗Transformers	0	540	March 31, 2022
What GPU should I use to fine tune LongT5? Beginners	0	306	January 27, 2023
Self-made Longformer doesn't take more than 512 token 🤗Transformers	0	459	January 5, 2022
No benefit from turning on gradient_checkpointing: True 🤗Transformers	1	172	October 24, 2024

Tips for training LongT5

Related topics