Continuous training on Fine-tuned Model

:rocket: Feature request

How can I continue training on a Fine-tuned Model?
I have a fine tuned model from OpenSLR data. And I want to continue training on an model as I continue to gain transcribed audio data over time. Can I do like making the fine-tuned model as a checkpoint?


I am aiming to make a model for Nepali Language. I have a way to collect data over time and it is continuous. So, I want to find a way I can train the model continuously as I gain data over time

In PyTorch, you can further train a model just by putting it in training mode (model.train()), and then train as usual. This will update the parameters of the model.

Thank you for your reply.
I tried this as well. My old dataset and the new dataset has different texts.
This method makes the model heavily lean towards the new text provided. This results in the text which were transcribed properly in the first training create a transcript more leaning towards the newer texts even resulting in correctly transcribed text also becomes wrongly transcribed due to this.

Perhaps you can try freezing some layers, and only fine-tune a specific layer.

In PyTorch, this can be done as follows:

for name, param in model.named_parameters():
     if name == ...:
        param.requires_grad == False

Will try this. Thank you!

This should be param.requires_grad = False. But I don’t see how freezing would help OP? Can you clarify that?

@noskid The best approach is to simply always finetune on the WHOLE dataset (old+new, preferably shuffled) so that the model is not biased on any specific subset.

1 Like

I also thought about this as an alternative.
But, since my datasets are very large. The training time would just cumulate to a very big amount of time. So, I wanted to know if any other methods were available.

I have been looking at this problem in terms of token useage for training. Purely from a CFO standpoint on cost of token uses for a npc character in a game devop. I think there are practical continuous time problems that the fine-tuning is unable to insert new data in a model without a relearning process from scratch. This changing the output as changing knowledge representation(biasing). Sort of like what happens if you had in your own life missed an opportunity and everything changed, you might have a different response over that one butterfly effect moment or instance. Freezing layers may limit granularity of response thus weighting when changed data inhibits epoch summing or rather activation of tensor values for depth (weight activation) of response to determine the role continuous updated fine tuning plays in behavioral produce of the ai as learning a continuous behavior. Arguably google, Allen brain and OpenAi gpt-3 are working to develop pretraining methods that excise (potential pres-censoring social values in large data sets like removing all religious references) data to make faster better cheaper transformer technology. Yet I feel this stripping of data is preceding a massive ai movement driven by compute resources that will censor data sets for efficiency to enable a continuous fine tuning method that essentially will pretrain a fine tuned model with new data thus stepping layer(s) in a way where managing a continuous update is In essence rather pretraining a long term set. Think of this as your update continuous as short term memory buffering that has to mediate a pretrain of fine-tuned model as synaptic consolidation every 24 hours. Only certain weights will change. But with a long term memory data set at least the initial set trained as having a minimum threshold to do this with. 10,000 lines at 2048 tokens. Obviously I’m not a full programmer and I am in a theory role for cost exploits in a tokenization problem, but I think you need to establish a cycle over 24hrs as continuous or contiguous nature of human reasoning will never be fully cognitive until the computer chip as neurosymbolic can assess backpropagation through time and structure as to mediate temporal synchronicity with a near quantum logic of data exchange. So buffering your input as a short term memory buffer each day may allow synaptic consolidation to be the theory space you need to think about to determine how you treat long term memory i.e. fine tuned data set for updates in pretraining c-rnn chunks. Freezing layers would be like purposely implementing Alzheimer’s in your fine tune model. But again I’m just the cost benefit planner for a techint in this domain and it’s a hard problem without neurosynaptic cores/neurosymbolic to create a base of new neural states technology. My strong suggestions is plan flexibility for market that can adjust with ar/ai mediated classifiers using spatial strong data with bpts methods as the balance of time/space will eventually suggest non locality problems with memory consolidation as a constant adjusted pervasive fine tuning that is required for cognitive guidance and navigation problems. You need a unilateral human neurobiology problem such as inferior pariatal cortex role in storage of geospatial knowledge representation to adjust during spatial events right? Meaning where are constant updates most associated to human reasoning and spatial tensors (tensor cannocalization) as to argue the problem you need will create the remedy you search for as bpts/bptt problems for summation epochs with activation functions fails to require continuous fine-tuning to achieve a goal for reaching a accept state(s). Every neuron has locality and a cam or cas right? Locality and physical location s of neurons are states in human neurobiology that have locations where weights activate locations lest all neurons be in one place or locality. Continuous weight adjusting will never be feasible until underlying weights have content addressable memory situated to activations that have duality of externalized memory i.e. fine-tuning representing that externalized input value you seek to adjust in continuous time is unilaterally on a spatial layer for adjusting in planning/prediction problems. Sure it may be a static sensor array data but if derived data from moving systems the fine-tuning must understand movement, motion, and thus space/time computing revolutions with bptt/bpts and neurosymbolic computing should be considered. Invariably quantum compute methods will undermine locality problems while neurosymbolic methods will empower non-locality problems. Anyway I can’t wait to see what method was implemented and what major company buys into flawed quantum computing solution…i.e. metaverse quantum echo chamber and deep ai recursion of user data for market steer ability of a geospatial knowledge representation to guide ar user’s to product placement with pervasive fetishes of infinite divisibility problems guiding consumer behavior to a never ending feedback loop to contain the human species with a fine tuning reality - the future is adjustable.