Inconsistent Training Time with Accelerate

tuan124816 · November 8, 2024, 1:53pm

Hi all,

I’m training a Decision Transformer model on my own dataset, but I’ve encountered an issue with inconsistent training times. Initially, the training time was quite slow (around 130s/it), so I decided to use the Accelerate library and enabled TF32 for faster training. After making these changes, the training time dropped significantly (around 27s/it), and the DataCollator also ran much faster (from ~1m 39s to 19-20s).

However, after running the code a few more times, the training time reverted to its original slow speed, and now it consistently stays at the old speed. Just to note, I make sure to turn off all other applications when training to avoid resource conflicts, so I’m not sure what’s causing the issue.

Here is the code and dataset for anyone interested in reproducing the issue and troubleshooting:

Code & Dataset: [accelerate_test - Google Drive]

System Specifications:

OS: Ubuntu 24.04.1 LTS
Kernel: 6.8.0-48-generic
CUDA Version: cuda_12.0
PyTorch Version: 2.5.1+cu124
Accelerate Version: 1.1.1
Hardware:
- GPU: NVIDIA GeForce RTX 4060 Laptop - 8GB RAM
- CPU: Intel(R) Core™ i7-14650HX
- RAM: 32GB

Any help or suggestions on how to solve this issue would be greatly appreciated!

Topic		Replies	Views
Dataloader fetches slowly using accelerator for distributed training 🤗Accelerate	0	1204	October 29, 2021
Decreasing performance when using Accelerate 🤗Accelerate	1	2253	March 8, 2022
Low GPU utilization with the Decision Transformer Models	6	451	October 30, 2024
Very slow training (>5mins per batch) - code review request Research	2	642	October 11, 2023
Worse performance using Accelerate 🤗Accelerate	0	1048	January 15, 2024

Inconsistent Training Time with Accelerate

Related topics