DALL-E - mini version

Current Status Summary

Repo

  • on github
  • on huggingface - we’ll push from github at the end + add models
  • Workflow: I’m adding everyone as collaborator on the github (send me your username). As we need to be fast I suggest that we do “PR + 1 approval from anybody = merge to main branch”. Small updates (typo’s, quick bug fix, readme…) may not even need approval but just notify on the discord

General Architecture

  • Seq2Seq
  • input is tokenized text (with a text encoder)
  • output is tokenized image (with VQGAN)

Datasets

  • :white_check_mark: Conceptual 12M data prepared by @greeneggsandyaml
  • :white_check_mark: Conceptual 3M data prepared by @khalidsaifullaah :partying_face:
  • :black_square_button: YFCC100M: I’m working on creating the OpenAI subset on my local machine (looking good so far, I expect 2TB max). If it works I’ll try to upload to datasets for streaming, I created a post to see if it’s feasible
  • :black_square_button: Can somebody prepare a mini dataset that can be easily shared with others and used for colab prototyping of the different tasks?

VQGAN

  • :information_source: there is an existing jax model
  • needs to be finetuned on our dataset
    • :black_square_button: @lkhphuc is trying to make a jax training script (no existing one available)
    • :black_square_button: alternatively we can use taming-transformers to train on custom dataset and convert to jax: I may be able to try it but any volunteer would be appreciated (on their local GPU or on our TPU VM)
  • ideally we need to finish by Friday latest so we have at least a week of training for our full model (which will give us the time to finalize our scripts in parallel)
  • for people working on other tasks, just use pre-trained model for now (refer to Suraj model). This will be our VQGAN if we don’t successfully fine-tuning it in time

Text encoder

  • :black_square_button: select a base model, non-autoregressive + check it handles positioning
  • :black_square_button: can we find a good pre-trained model that does not need fine-tuning (I imagine we would freeze it)

Seq2Seq

  • :information_source: Maybe we can adapt jax/hybrid-clip scripts - Suraj mentioned their efficient data loading
  • :black_square_button: loading data logic
  • :black_square_button: loss definition + hyperparameters (research similar papers)

Demo

  • :black_square_button: based on how long it takes to generate images, we could sample from a few and re-rank them with existing OpenAI CLIP
  • :black_square_button: create inference function
  • :black_square_button: it would be cool for our demo to work with huggingface widgets (PR in progress)

As usual, feel free to choose where you want to help!

Finally let’s schedule a call with Suraj.
From his calendar, the best for me would be anytime after 8AM Pacific Time. What would work for you?

7 Likes