Xposting from here because I didn’t realize there was a separate forum before posting there. Hopefully not a big deal.
Hello all. I’ve been playing with this for a few weeks now and formulating my thoughts into words as best as my brain will allow me to do. Boris suggested I post on these forums for input, so here I am.
DALL E and dalle-mini inspired me in a big way. I’ve had an affinity for language for years, and DALL E helped me connect some dots, I think. Let me start out by making some suggestions, since that is valuable to everyone else maybe and I like to lead with offerings of value.
The understanding that dalle-mini has of the semantic content encoded in language can be improved readily in a couple of ways, and I’m going to try to explain why. One way is by feeding it video input. Video input will allow it to approximate the functions associated with verbs and some other words that have a higher dimensionality to them. Secondly, because it has been trained against image captions, it understands only a small subset of the grammatical constructs available in English. That is to say, it performs best when provided input that is descriptive like an image caption. Emulate that grammar in order to successfully encode semantic meaning into language for it to process.
Its understanding of English grammar could be improved by also giving it a variety of language with a variety of cases, however in order to learn some of the temporal aspects of language, it will need input with a temporal aspect to it. Video input. Past tense and future tense cannot be learned any other way. Probably there is no good data set yet on which to do such training, but that’s not the end of the world. Improving its understanding of grammar serves to make it easier for us to give it prompts it will understand, but it does not increase the total number of possible scenes it can render. The number of scenes it can render is limited by its understanding of the underlying concepts, the words, and how they combine coherently. That is what it is currently being trained to do. So how do we measure its understanding of a given concept?
Take the word ‘red’ for instance. Red is a color. The word is connected to many abstract ideas, isn’t it? Not just one shade of color, but rather a large swath of colors all fall under the idea ‘red’. Red is a 2 dimensional word in that sense. When you give it the word red, if it has a good understanding of the word, it should give you output that shows an understanding of that complexity. It should be a mixture of all the different shades of red. When its understanding is poor, it will be only a few shades of red. You can test that, and I have. Go ahead and do it. On some executions you will see it demonstrates a good understanding of red, and on others worse understanding.
Words can have more than just 2 dimensions though. They can have 3 dimensions, like a dog. A dog is a 3 dimensional object. Words can also have 4 dimensions. Run, for instance. Running encodes information that includes 4 dimensions, the 4th being time. Words can even go further, and some do–like ‘hypercube’, which encodes information about 5 dimensions.
We can use grammar to narrow words with high ambiguity down. Instead of red, we can have dark red. Very dark red. This narrows us down closer and closer to a specific idea. For words with higher dimensionality, ambiguity is very high and we need to reduce that ambiguity through training. Grok it?
We can already at this point ask DALLE-mini to output representations of abstract concepts, like red or blue etc, in order to verify whether or not it has a good understanding of the higher dimensional components of those concepts. We should select for the output that shows understanding and reject the output that collapses the semantic meaning into lower dimensions. A single shade of red for the word ‘red’ is wrong because it does not capture the dimensional complexity that the word ‘red’ encodes.
One last thing, which is this: I would love if someone were able to help me with something I want to test. We should be controlling the random seed and incrementing it progressively to see how exactly each frame differs from the prior. I would suspect each frame will be very very similar with small changes, wibbly-wobbly like a dream interpolating between the possible representations of the total semantic meaning encoded in the language.
Looking forward to input on this! Thank you.