We Could train NLP based models to generate embedding for sentences and then use this embeddings as input embedding for gan model to generate images. This is pretty simple idea just vice versa of image captioning, but I think it can work pretty well given the performance of transformers based models.
Also regarding architecture, seems using a simple scaled-up GPT2 as LM can also give good results (see CogView). By simple, I mean no sparse attention or row-column attentions, etc which is used in the DALL-E I guess. And GPT2 is already available in JAX. And JAX is way faster on TPU than PT
And if you look at this discussion it seems an even smaller model could give good enough results on the domain-specific dataset.
Also as suggested in the discussion above using the VQGAN from taming-transformers as the images tokenizer can further reduce the complexity of training such models as the max image token length for these VQGAN models is 256 (way less than DALL-E’s VQVAE, which uses 1024), so overall 256 text tokens and 256 image tokens = 512, should be manageable on a single v3-8.
Option 1: build our own structure as suggested by @Vaibhavbrkn (text embeddings → image GAN) + maybe inspiration from DALLE-pytorch and CogView. Can be difficult (I always have trouble training GAN’s) but a great learning experience too!
Option 2: train XMC-GAN, I imagine this one is actually fully complete and we would just need to set up the dataset and plug everything in correctly.
I am working as a Computer Vision Research Consultant and can contribute to topics that require knowledge in Vision.
DALL-E is something that is quite new and I would like to be a part of this project that recreates it.
I am a fast learner and a team player.
I have experience working with Python (3 years) and PyTorch (1+ year) and version control (2 years)
I have experience in public-speaking, and good communication and writing skills. With this, I will be able to contribute significantly towards creating the demo.
I am especially interested in this because it democratizes Dall-E and lowers the barrier to use this model, as I have special interest in such projects.
Please consider my request to join in this effort. Thanks.