Is iterative training advisable?

I have 1.5 million training inputs, which will take about 2 weeks to train.

My concern is “what if the training fails after 10 or 13 days, towards the end of the training?”, resulting in a waste of time and money.

So I have been trying to figure out how to train a model, save it, then continue training, save, continue training, etc., breaking the training into chunks to avoid the scenario of it failing towards the end of the process. This way I can also test the model in between training sessions to ensure the generated outputs are improving.

However, I have read some things indicating the model will “forget” things learned from earlier trainings and that it is advisable to train the model all at once.

Can somebody please confirm whether it is advisable to train all at once on an entire dataset as opposed to breaking it into chunks?

Also, I am using Peft and I have seen examples of merging models, but will this also have the potential of the weights not reflecting the entirety of the training as opposed to training all at once?

Any advice would be greatly appreciated.

Hi @capnchat, why not save checkpoints during training in case it crashes and then reload from the last checkpoint if needed?

2 Likes

Hi @dblakely thanks a lot for your reply.

Ok, so load my entire dataset, then simply interrupt training periodically for testing and then reload from checkpoints, which will avoid performance loss from training on subsets of the data sequentially and saving in between datasets? Ok that makes sense, thanks a lot.

By the way, if you had to choose between using LoRa/PEFT to reduce memory utilization OR simply using a stronger environment to train, which would you choose if maximizing performance and minimizing the time needed to train are the objectives? Is it correct that performance would be reduced using PEFT? The goal is to create a dental chatbot trained on conversations between dentists, and so I need to convince them to pay for a stronger environment if it’s advisable.

I have only been experimenting using PEFT because of environment constraints, but I assume using PEFT impacts the model’s performance with regard to generating text and increases the time it takes to train. So if we had a stronger env, is it correct to assume training would take less time without PEFT and the model would potentially yield more impressive inferences?

Ok, so load my entire dataset, then simply interrupt training periodically for testing and then reload from checkpoints, which will avoid performance loss from training on subsets of the data sequentially and saving in between datasets? Ok that makes sense, thanks a lot.

Yes, you could do it that way. You can also try running generation with the current model during training so you don’t have to halt. This post gives an example of how you could do this.

I assume you’re saying you don’t have the hardware to train and then evaluate the model on the side without halting your training run? Worth noting that inference is a lot less resource intensive than training, so you might be able to evaluate on the side (e.g., on a weaker machine) without halting training.

By the way, if you had to choose between using LoRa/PEFT to reduce memory utilization OR simply using a stronger environment to train, which would you choose if maximizing performance and minimizing the time needed to train are the objectives? Is it correct that performance would be reduced using PEFT? The goal is to create a dental chatbot trained on conversations between dentists, and so I need to convince them to pay for a stronger environment if it’s advisable.

Unfortunately, I don’t have much personal experience using LoRa or PEFT. My understanding is that they degrade performance a bit compared to full fine-tuning, but usually, it’s acceptable. Though for a user-facing commercial application, a full fine-tune would be my preference. Also worth considering that usually training costs are a small minority of the total cost of using the model compared to serving it for its lifetime.

Also if you haven’t already, you might want to consider tuning the batch size, using flash attention, enabling gradient checkpointing, etc to reduce training time and get more mileage out of your training budget. This is a good resource too if you’re not familiar.

1 Like