really im wondering why !
the true inovation is actually the transformer itself !
its an any input model ---- to ---- Any output !
its only that with the llms we are usingn text based models !
but as you know an image can be represented as text ! and we know that a sound can be represented as an image :
So we can ntake the text based representation and deal with it !
i was also examining the stabe dife=ussers of this world as well as the image model and in fact they are not actually doing nothing that the llm cannot do !
its only how to frame the question and how to interpret the answer !
Previoulsy in uni 20years back ! we trained a network to perform maths tasks !
So We know that a neural network can do maths but why is the models so bad at math ( its only because it was not trained to perform maths) …
There are a lot of repos oin the hub as your aware ! ad there is even a lot of good data ( lots of traps too ) but there is a model and data which can control a drone !
So in inn fact the guy did not create a custom model he used a llam@
the Pixtral the qwen ! etc these models are jsut rendtions of the llava model !
which is comprised of a clip and llm ! ( easliy spun up @ ! ) this is just takingn advantage of the dedicated pretrained models !
but i truth we can just train our llms to perform the task !
by correctly framing the question !
i have had some recent (EASY) sucess with training the model on the base64 !
It liked the images representted as base64 ! i was able to train for captions, descriptions aas well as for generations !
When i expanded to include spectrographic images as Audio ( ie pictures of waves ) it also took to them as well as generated images which could be converted inot sounds !
So in fact it was so easy ! i could not understadn why people needed to diverge to a clip or llava model ! ( very memeory intensive )
Personally i dont know what people do with thier models as for me each model is a Investment as training is not free ! ( so i train on top of my past training ( it just the most enjoyable thing ! )) so in fact the model does become more trained that the givens ! even chatgpt !
so why should i evebn consider jumping ship to the latest lovely llam which also has image in it ( in fact its a llava model !!) …when i have spent so much on uncensoring and training ! ( questions i can still ask myu modle are now banned in the original ! ( did they change it or is it the various guardrails in all the apps and librarys ! )
So i realised my tensor stack is very versitile and all thoise 7b params could never all be trained ! ( we have only touched the surface of training the capabilitys of a 7b model !
in fact they cannot be compared to the larger models which still have not proved thier necesity ? why so much parameters ? where is the metric to say oh yes i have trained all my partams to thier limits and now i must up my params !
when they did not even trai all the paradigms or pssibilitys they just did a fineweb ! Crawl ! TRAIN !
lol !
they are monys playing with neural networks !! My personal belif was that AGI was going to come t=from a coded model and not a Cheating model like a neural net ! Obvioulsy not agentic either !
SO now im a convert to the transformer arch !