Adding another head to Vision encoder decoder model

I want to add another head to a VisionEncoderDecoder Model namely the model Donut but when i do so the second doesn’t seem to learn anything (the loss is barely decreasing). Additionally i can’t use the generate function with two heads.
So, is there a tutorial or does someone have a notebook or a hint that can help me make such modifications ?

when you create the model it has w=two components one is th eimage part and the other the LLM … so you can use the donut model for this part : then save the model as pretrained : (convert to fp16 first) then you can treat it like a fresh model and train it !::

I also did a simular project to this : it worked fine ! problem was that the GGUF could not be made !!! as it is not a compatible model : so it will need to run as weights!

Thank you for your answer !
So , You are advising me to save my model after adding the second head and reimport it and then to train it. Did I get it right?

yes as the memory in colab times you out and can disconnet the runtime ??

later i made a mistral with less layers (ie 1b but training was taking too long) … so : its probably right to add a fully trained model despite the size of the model LeroyDyer/Mixtral_AI_MiniTronVision this was a model i used the small brain on !
LeroyDyer/Mixtral_AI_MiniTronSpeech << the speech version :slight_smile:

one is a vision encoderDecoder and the other a Speech Encoder Decoder :slight_smile:

Vmodel = VisionEncoderDecoderModel.from_encoder_decoder_pretrained( "google/vit-base-patch16-224-in21k", "LeroyDyer/Mixtral_AI_Tiny" ) _Encoder_ImageProcessor = Vmodel.encoder _Decoder_ImageTokenizer = Vmodel.decoder _VisionEncoderDecoderModel = Vmodel



Add Pad tokems
LM_MODEL.VisionEncoderDecoder = _VisionEncoderDecoderModel

Add Sub Components
LM_MODEL.Encoder_ImageProcessor = _Encoder_ImageProcessor LM_MODEL.Decoder_ImageTokenizer = _Decoder_ImageTokenizer LM_MODEL

at that point is convert to fp16 and save to pretrained!

but really it should be one shot trained ! so you need to have a training ready for this model before saving if … you have the gpu(s) and Memeory :slight_smile:
but when you do the instance of the model in the begining it takes upto 35 gig GPU! hence trail and error !
so after this you still need to have at least 5-10gb memeory to run a training ! as the models are in memeory ! << Issue!
so I offloaded the model to disk (pretrained and suffered the loss of first pass) … but after SFT training is still fine ! it just takes a litlle longer : hence tiny datsets to begin training random models! until over fit to the dataset : enabling for the first task to embedd into the model :slight_smile:
on the next training you can begin hopefull converging much quicker !
models do not converge instantly they need many sample and many epochs: (best to use small dataset of 1k) and have a few different ones ready ! so once the first is overfit go to the second and you will se how far away it is!!!
so repeat untill in begins converging quicker then you can train the model properly (also keep messing wiht the lora config so you can get random parameters to tune !