DALL-E - mini version

valhalla · June 25, 2021, 12:00pm

That’s a great idea, I’m super excited about this!

Regarding the dataset, here are a few that can be used for this

Conceptual Caption - ~3.3M image-text pairs
Conceptual 12M - 12M
WIT- ~37M

Also regarding architecture, seems using a simple scaled-up GPT2 as LM can also give good results (see CogView). By simple, I mean no sparse attention or row-column attentions, etc which is used in the DALL-E I guess. And GPT2 is already available in JAX. And JAX is way faster on TPU than PT

And if you look at this discussion it seems an even smaller model could give good enough results on the domain-specific dataset.

Also as suggested in the discussion above using the VQGAN from taming-transformers as the images tokenizer can further reduce the complexity of training such models as the max image token length for these VQGAN models is 256 (way less than DALL-E’s VQVAE, which uses 1024), so overall 256 text tokens and 256 image tokens = 512, should be manageable on a single v3-8.

Topic		Replies	Views
Vision-Language Project Ideas Flax/JAX Projects	13	1545	June 30, 2021
On language as an information compression heuristic, and how to improve dalle-mini rapidly Beginners	2	548	August 5, 2022
Reproducing and Extending BEIT Flax/JAX Projects	4	1209	July 24, 2021
CLIP like contrastive vision-language models for German with pre-traind text and vision models Flax/JAX Projects	5	1827	July 4, 2021
PreTrain GPT-2 from scratch for German on novel GC4 dataset Flax/JAX Projects	7	1200	July 2, 2021

DALL-E - mini version

Related topics