Hi all,
I’m training a Decision Transformer model on my own dataset, but I’ve encountered an issue with inconsistent training times. Initially, the training time was quite slow (around 130s/it), so I decided to use the Accelerate library and enabled TF32 for faster training. After making these changes, the training time dropped significantly (around 27s/it), and the DataCollator
also ran much faster (from ~1m 39s to 19-20s).
However, after running the code a few more times, the training time reverted to its original slow speed, and now it consistently stays at the old speed. Just to note, I make sure to turn off all other applications when training to avoid resource conflicts, so I’m not sure what’s causing the issue.
Here is the code and dataset for anyone interested in reproducing the issue and troubleshooting:
Code & Dataset: [accelerate_test - Google Drive]
System Specifications:
- OS: Ubuntu 24.04.1 LTS
- Kernel: 6.8.0-48-generic
- CUDA Version: cuda_12.0
- PyTorch Version: 2.5.1+cu124
- Accelerate Version: 1.1.1
- Hardware:
- GPU: NVIDIA GeForce RTX 4060 Laptop - 8GB RAM
- CPU: Intel(R) Core™ i7-14650HX
- RAM: 32GB
Any help or suggestions on how to solve this issue would be greatly appreciated!