Multi modal models ( REALLY DO WE NEED IT? ) Can a Causal LM sufice?

LeroyDyer · October 10, 2024, 6:10am

HA !

indeed yes !

LeroyDyer/SpydazWeb_AI_LCARS_TextVision_002

My current model has begun training to handle images and audio ! generation and description !

HOW?

( IMAGES )
I have found that by converting a image into base64 the you can present this image to the model as an image !

hence by trainig the model to handle base64 obvioulsy it can handle images !!!

( SOUNDS)

But by converting the soud to a spectrogrphic image it can then convert it to base64 ! and process the image of thew sound ! generating a description etc !

here i have also been prudent to use dexcription and generation side by side !

So the model can geerate sound as an image and also generate image as images ( base64)

as well as interpret these !

So with the normal Causal LLM !

we can weave in these elements as a json object ! ( or even just inside the prompt ) !

EM why are people creating difusion models ? as the model can do EVERYTHING !

it is a Transformer , which is a neural network ! ( hece it can be trained to do anything ! _ )) the causal lm is only the forwards function !

watch out for the models i am training from this model as they will start to become the multimodal ! model without vbecomming a difuser or a llava or a clip model !

( in fact they have become moot )

Question !

What are you thought on this !

i have trained only a 7b and am still trainign so we can prompt tune the model !!

using some of the stable difusion data !
let see how it goes !

John6666 · October 10, 2024, 9:09am

Well, the multimodal language model is just evolving; it is at a different stage in its evolution than the SD-based model.

When technology reaches the practical stage, people often seek average performance, energy efficiency, cheapness, availability, and a full array of related resources rather than versatility and peak performance.
SD1.5 and SDXL are good examples of this, as there are a number of models at the same level of potential, but SD-based models have a base model that has been trained to meet the needs of many people, plus LoRA for specific people, objects, painting styles, compositions, and poses, and even tagging tools and scripts for training. Everything is in place, so it is used by everyone and enters a virtuous circle.
It’s also inexpensive to train only one UNET, and it’s a nice common language: 16-bit precision, 1.5GB for SD1.5, 5GB for SDXL.

Bicycles, Windows, and Python are useful, right? Not because they are superior, but because they are widespread. It’s a prank of history.
No one can swim with better energy efficiency than a fish. The specialized structure of the end result of evolution has that much advantage.
I don’t think the current SD models have fully reached these realms, but I think they are among the few that are already moving in that direction.

It would be fun to have a multimodal model that can evolve by eating SD-based UNETs.

LeroyDyer · October 10, 2024, 2:55pm

really im wondering why !

the true inovation is actually the transformer itself !

its an any input model ---- to ---- Any output !

its only that with the llms we are usingn text based models !
but as you know an image can be represented as text ! and we know that a sound can be represented as an image :
So we can ntake the text based representation and deal with it !

i was also examining the stabe dife=ussers of this world as well as the image model and in fact they are not actually doing nothing that the llm cannot do !
its only how to frame the question and how to interpret the answer !

Previoulsy in uni 20years back ! we trained a network to perform maths tasks !
So We know that a neural network can do maths but why is the models so bad at math ( its only because it was not trained to perform maths) …

There are a lot of repos oin the hub as your aware ! ad there is even a lot of good data ( lots of traps too ) but there is a model and data which can control a drone !
So in inn fact the guy did not create a custom model he used a llam@
the Pixtral the qwen ! etc these models are jsut rendtions of the llava model !
which is comprised of a clip and llm ! ( easliy spun up @ ! ) this is just takingn advantage of the dedicated pretrained models !

but i truth we can just train our llms to perform the task !
by correctly framing the question !

i have had some recent (EASY) sucess with training the model on the base64 !
It liked the images representted as base64 ! i was able to train for captions, descriptions aas well as for generations !
When i expanded to include spectrographic images as Audio ( ie pictures of waves ) it also took to them as well as generated images which could be converted inot sounds !

So in fact it was so easy ! i could not understadn why people needed to diverge to a clip or llava model ! ( very memeory intensive )

Personally i dont know what people do with thier models as for me each model is a Investment as training is not free ! ( so i train on top of my past training ( it just the most enjoyable thing ! )) so in fact the model does become more trained that the givens ! even chatgpt !

so why should i evebn consider jumping ship to the latest lovely llam which also has image in it ( in fact its a llava model !!) …when i have spent so much on uncensoring and training ! ( questions i can still ask myu modle are now banned in the original ! ( did they change it or is it the various guardrails in all the apps and librarys ! )

So i realised my tensor stack is very versitile and all thoise 7b params could never all be trained ! ( we have only touched the surface of training the capabilitys of a 7b model !
in fact they cannot be compared to the larger models which still have not proved thier necesity ? why so much parameters ? where is the metric to say oh yes i have trained all my partams to thier limits and now i must up my params !
when they did not even trai all the paradigms or pssibilitys they just did a fineweb ! Crawl ! TRAIN !

lol !
they are monys playing with neural networks !! My personal belif was that AGI was going to come t=from a coded model and not a Cheating model like a neural net ! Obvioulsy not agentic either !

SO now im a convert to the transformer arch !

Topic		Replies	Views
Fine-tunening a multimodal model Beginners	4	4996	December 25, 2024
Model that can generate both text and image as output Research	5	1454	December 31, 2024
Multimodal datasets and corresponding models Beginners	2	77	March 12, 2025
Multimodal LLM with Image and Text sequentially in its prompt 🤗Transformers	2	12203	January 1, 2024
Any Multi Modal LLMs that take direct pdf + text as input? 🤗Transformers	2	1914	October 10, 2024

Multi modal models ( REALLY DO WE NEED IT? ) Can a Causal LM sufice?

LeroyDyer/SpydazWeb_AI_LCARS_TextVision_002

HOW?

Related topics