Ok, so load my entire dataset, then simply interrupt training periodically for testing and then reload from checkpoints, which will avoid performance loss from training on subsets of the data sequentially and saving in between datasets? Ok that makes sense, thanks a lot.
Yes, you could do it that way. You can also try running generation with the current model during training so you don’t have to halt. This post gives an example of how you could do this.
I assume you’re saying you don’t have the hardware to train and then evaluate the model on the side without halting your training run? Worth noting that inference is a lot less resource intensive than training, so you might be able to evaluate on the side (e.g., on a weaker machine) without halting training.
By the way, if you had to choose between using LoRa/PEFT to reduce memory utilization OR simply using a stronger environment to train, which would you choose if maximizing performance and minimizing the time needed to train are the objectives? Is it correct that performance would be reduced using PEFT? The goal is to create a dental chatbot trained on conversations between dentists, and so I need to convince them to pay for a stronger environment if it’s advisable.
Unfortunately, I don’t have much personal experience using LoRa or PEFT. My understanding is that they degrade performance a bit compared to full fine-tuning, but usually, it’s acceptable. Though for a user-facing commercial application, a full fine-tune would be my preference. Also worth considering that usually training costs are a small minority of the total cost of using the model compared to serving it for its lifetime.
Also if you haven’t already, you might want to consider tuning the batch size, using flash attention, enabling gradient checkpointing, etc to reduce training time and get more mileage out of your training budget. This is a good resource too if you’re not familiar.