Trainer Api Training Arguments (info source where to look)

KaranHugger · May 31, 2024, 9:14am

Hello I am working with Trainer api and I wanted to know if you have a source where the Training arguments are better explained. I do not possess the vocabulary to understand what each of them mean. While training, I am facing a CUDA OOM error and I couldnt pinpoint which ones control the GPU memory usage.

nielsr · June 1, 2024, 10:32am

Hi,

Would recommend this guide for efficiently training on your GPU: Methods and tools for efficient training on a single GPU.

I’d recommend starting with a small batch size, use gradient accumulation/checkpointing, use a non-memory intensive optimizer like Adam-8bit, etc.

Also if you’re training an LLM it might help to use LoRa instead of full fine-tuning

KaranHugger · June 6, 2024, 1:20pm

I should point out the particulars I am using the pegasus-dailymail-cnn model. I have a small dataset of 14372, 819, 819 of train test val split. And currently my parameters are set to following and it is still causing me OOM errors. I will check out the given link. But I would really welcome any quick idea you can give me based on my current parameters
params

Topic		Replies	Views
New Trainer Doc no some properties but Old Doc have (n_gpu, parallel_mode) 🤗Transformers	3	302	December 6, 2022
Open Source LLM models I can use for P620 2GB GPU Beginners	0	753	June 16, 2023
Trainer() and required_grad=false 🤗Transformers	1	284	January 18, 2024
Training using multiple GPUs Beginners	20	20155	February 25, 2024
How is it possible to get GPU memory errors when increasing the gradient_accumulation steps? Intermediate	1	1395	January 22, 2024

Trainer Api Training Arguments (info source where to look)

Related topics