DALL-E - mini version

boris · June 25, 2021, 1:52am

Project Recreate a mini-version of DALL-E

Useful links:

Original post:

It would be cool to recreate something like DALL-E.

I can see some challenges though:

difficult paper though the main concepts could be simplified
availability/collection of dataset
the model used was GPT-3, not sure if interesting results can be obtained with smaller models
may require more time to implement

Also I don’t believe a public version is available so maybe it will be easier to just do it in Pytorch (I have no experience yet with Flax/JAX).

Vaibhavbrkn · June 25, 2021, 3:12am

We Could train NLP based models to generate embedding for sentences and then use this embeddings as input embedding for gan model to generate images. This is pretty simple idea just vice versa of image captioning, but I think it can work pretty well given the performance of transformers based models.

raghav66 · June 25, 2021, 6:08am

Thats a great idea. Can I join?

valhalla · June 25, 2021, 12:00pm

Hey @boris

That’s a great idea, I’m super excited about this!

Regarding the dataset, here are a few that can be used for this

Conceptual Caption - ~3.3M image-text pairs
Conceptual 12M - 12M
WIT- ~37M

Also regarding architecture, seems using a simple scaled-up GPT2 as LM can also give good results (see CogView). By simple, I mean no sparse attention or row-column attentions, etc which is used in the DALL-E I guess. And GPT2 is already available in JAX. And JAX is way faster on TPU than PT

And if you look at this discussion it seems an even smaller model could give good enough results on the domain-specific dataset.

Also as suggested in the discussion above using the VQGAN from taming-transformers as the images tokenizer can further reduce the complexity of training such models as the max image token length for these VQGAN models is 256 (way less than DALL-E’s VQVAE, which uses 1024), so overall 256 text tokens and 256 image tokens = 512, should be manageable on a single v3-8.

boris · June 25, 2021, 3:52pm

Those are really great ideas @Vaibhavbrkn @valhalla (and thanks for the datasets)!
@raghav66 Anybody can join!

Also I noticed lucidrains has a pretty awesome repo that should help so we can pull some items from it as well: DALLE-pytorch

valhalla · June 25, 2021, 3:55pm

Also I noticed lucidrains has a pretty awesome repo that should help so we can pull some items from it as well: DALLE-pytorch

Yeah, the linked discussion and the VQGAN idea is from the DALLE-pytorch repo

Also, this just dropped GitHub - google-research/xmcgan_image_generation

srmsoumya · June 25, 2021, 4:34pm

I would love to try this as well.

valhalla · June 28, 2021, 4:38pm

Let’s officially define this project

Putting everybody in the official sheet here. More people can still join! Leave a comment here or on the sheet if you want to change something.

morgan · June 29, 2021, 10:44am

Great idea, would love to join!

morgan · June 29, 2021, 11:11am

Also we could try use GPT-J, a 6bn param version of GPT-3. There is a PR for a HF port open atm too

devv · June 29, 2021, 11:53pm

Thank you @Boris, for opening accepting more people to join. I am very interested in joining as well. Thanks.

valhalla · June 30, 2021, 8:46am

Added you to the team

valhalla · June 30, 2021, 8:50am

I would love to be involved with this!

Yeah, we could GPT-J, but we might need to choose smaller model (maybe 1B ?) due to the time and compute constraint.

The HF PR is for PT, adding JAX version will take some time. But it’s somewhat similar to GPT2 which is already available in flax, which we could use.

I wrote a JAX version of VQGAN which can be used as the image tokenizer.

morgan · June 30, 2021, 10:32am

noice! Yeah I was unsure about the speed of GPT-J, at least its already in jax. Curious to see how much a TPUv3-8 can handle!

boris · June 30, 2021, 3:18pm

I created a Discord channel here.

Let’s try to finalize the details. I tried to summarize all the ideas from the discussion:

task: image generation from text
model
- image encoder: JAX version of VQGAN (@valhalla I assume it needs to be trained?)
- text encoder: JAX GPT2
- model/GAN structure
  - Option 1: build our own structure as suggested by @Vaibhavbrkn (text embeddings → image GAN) + maybe inspiration from DALLE-pytorch and CogView. Can be difficult (I always have trouble training GAN’s) but a great learning experience too!
  - Option 2: train XMC-GAN, I imagine this one is actually fully complete and we would just need to set up the dataset and plug everything in correctly.
dataset (thanks @valhalla):
- Conceptual Caption - ~3.3M image-text pairs
- Conceptual 12M - 12M
- WIT- ~37M
training script: TODO
outcome: model that accepts a sentence and generate a realistic image from it

Also feel free to pick a specific area you would want to focus on!

On my side I can also help instrument the training with W&B so we can visualize metrics and some sample images generated live.

ghosh-r · June 30, 2021, 7:12pm

I would really like to join this project.

I am working as a Computer Vision Research Consultant and can contribute to topics that require knowledge in Vision.
DALL-E is something that is quite new and I would like to be a part of this project that recreates it.
I am a fast learner and a team player.
I have experience working with Python (3 years) and PyTorch (1+ year) and version control (2 years)
I have experience in public-speaking, and good communication and writing skills. With this, I will be able to contribute significantly towards creating the demo.

I am especially interested in this because it democratizes Dall-E and lowers the barrier to use this model, as I have special interest in such projects.

Please consider my request to join in this effort. Thanks.

tmabraham · June 30, 2021, 7:24pm

I would love to participate in this project as well. Please let me know what I can do to help

ghosh-r · June 30, 2021, 9:03pm

@patrickvonplaten will you please officially add me? (and also add to the Sheets document)

Received confirmation from @boris on Discord.

tmabraham · June 30, 2021, 9:05pm

@patrickvonplaten I would also like to be also officially added, thanks!

lkhphuc · July 1, 2021, 8:22am

I would like to join this project as well if there’s still open slot. @valhalla @boris

Topic		Replies	Views
Vision-Language Project Ideas Flax/JAX Projects	13	1549	June 30, 2021
On language as an information compression heuristic, and how to improve dalle-mini rapidly Beginners	2	548	August 5, 2022
Reproducing and Extending BEIT Flax/JAX Projects	4	1210	July 24, 2021
CLIP like contrastive vision-language models for German with pre-traind text and vision models Flax/JAX Projects	5	1828	July 4, 2021
PreTrain GPT-2 from scratch for German on novel GC4 dataset Flax/JAX Projects	7	1201	July 2, 2021

DALL-E - mini version

Related topics