DALL-E - mini version

Project Recreate a mini-version of DALL-E

Useful links:

Original post:

It would be cool to recreate something like DALL-E.

I can see some challenges though:

  • difficult paper though the main concepts could be simplified
  • availability/collection of dataset
  • the model used was GPT-3, not sure if interesting results can be obtained with smaller models
  • may require more time to implement

Also I don’t believe a public version is available so maybe it will be easier to just do it in Pytorch (I have no experience yet with Flax/JAX).


We Could train NLP based models to generate embedding for sentences and then use this embeddings as input embedding for gan model to generate images. This is pretty simple idea just vice versa of image captioning, but I think it can work pretty well given the performance of transformers based models.


Thats a great idea. Can I join?

1 Like

Hey @boris

That’s a great idea, I’m super excited about this!

Regarding the dataset, here are a few that can be used for this

Also regarding architecture, seems using a simple scaled-up GPT2 as LM can also give good results (see CogView). By simple, I mean no sparse attention or row-column attentions, etc which is used in the DALL-E I guess. And GPT2 is already available in JAX. And JAX is way faster on TPU than PT :wink:

And if you look at this discussion it seems an even smaller model could give good enough results on the domain-specific dataset.

Also as suggested in the discussion above using the VQGAN from taming-transformers as the images tokenizer can further reduce the complexity of training such models as the max image token length for these VQGAN models is 256 (way less than DALL-E’s VQVAE, which uses 1024), so overall 256 text tokens and 256 image tokens = 512, should be manageable on a single v3-8.

1 Like

Those are really great ideas @Vaibhavbrkn @valhalla (and thanks for the datasets)!
@raghav66 Anybody can join!

Also I noticed lucidrains has a pretty awesome repo that should help so we can pull some items from it as well: DALLE-pytorch

1 Like

Also I noticed lucidrains has a pretty awesome repo that should help so we can pull some items from it as well: DALLE-pytorch

Yeah, the linked discussion and the VQGAN idea is from the DALLE-pytorch repo :grin:

Also, this just dropped GitHub - google-research/xmcgan_image_generation


I would love to try this as well.


Let’s officially define this project :slight_smile:

Putting everybody in the official sheet here. More people can still join! Leave a comment here or on the sheet if you want to change something.

Great idea, would love to join!

1 Like

Also we could try use GPT-J, a 6bn param version of GPT-3. There is a PR for a HF port open atm too

Thank you @Boris, for opening accepting more people to join. I am very interested in joining as well. Thanks.

1 Like

Added you to the team :slight_smile:

I would love to be involved with this!

Yeah, we could GPT-J, but we might need to choose smaller model (maybe 1B ?) due to the time and compute constraint.

The HF PR is for PT, adding JAX version will take some time. But it’s somewhat similar to GPT2 which is already available in flax, which we could use.

I wrote a JAX version of VQGAN which can be used as the image tokenizer.


noice! Yeah I was unsure about the speed of GPT-J, at least its already in jax. Curious to see how much a TPUv3-8 can handle!

I created a Discord channel here.

Let’s try to finalize the details. I tried to summarize all the ideas from the discussion:

  • task: image generation from text

  • model

    • image encoder: JAX version of VQGAN (@valhalla I assume it needs to be trained?)
    • text encoder: JAX GPT2
    • model/GAN structure
      • Option 1: build our own structure as suggested by @Vaibhavbrkn (text embeddings → image GAN) + maybe inspiration from DALLE-pytorch and CogView. Can be difficult (I always have trouble training GAN’s) but a great learning experience too!
      • Option 2: train XMC-GAN, I imagine this one is actually fully complete and we would just need to set up the dataset and plug everything in correctly.
  • dataset (thanks @valhalla):

  • training script: TODO

  • outcome: model that accepts a sentence and generate a realistic image from it

Also feel free to pick a specific area you would want to focus on!

On my side I can also help instrument the training with W&B so we can visualize metrics and some sample images generated live.


I would really like to join this project.

  • I am working as a Computer Vision Research Consultant and can contribute to topics that require knowledge in Vision.
  • DALL-E is something that is quite new and I would like to be a part of this project that recreates it.
  • I am a fast learner and a team player.
  • I have experience working with Python (3 years) and PyTorch (1+ year) and version control (2 years)
  • I have experience in public-speaking, and good communication and writing skills. With this, I will be able to contribute significantly towards creating the demo.

I am especially interested in this because it democratizes Dall-E and lowers the barrier to use this model, as I have special interest in such projects.

Please consider my request to join in this effort. Thanks. :slight_smile:

1 Like

I would love to participate in this project as well. Please let me know what I can do to help :slightly_smiling_face:


@patrickvonplaten will you please officially add me? :hugs: (and also add to the Sheets document)

Received confirmation from @boris on Discord.


@patrickvonplaten I would also like to be also officially added, thanks!


I would like to join this project as well if there’s still open slot. @valhalla @boris