DALL-E - mini version

boris · July 6, 2021, 2:54am

Current Status Summary

Repo

on github
on huggingface - we’ll push from github at the end + add models
Workflow: I’m adding everyone as collaborator on the github (send me your username). As we need to be fast I suggest that we do “PR + 1 approval from anybody = merge to main branch”. Small updates (typo’s, quick bug fix, readme…) may not even need approval but just notify on the discord

General Architecture

Datasets

Conceptual 12M data prepared by @greeneggsandyaml
Conceptual 3M data prepared by @khalidsaifullaah
YFCC100M: I’m working on creating the OpenAI subset on my local machine (looking good so far, I expect 2TB max). If it works I’ll try to upload to datasets for streaming, I created a post to see if it’s feasible
Can somebody prepare a mini dataset that can be easily shared with others and used for colab prototyping of the different tasks?

VQGAN

there is an existing jax model
needs to be finetuned on our dataset
- @lkhphuc is trying to make a jax training script (no existing one available)
- alternatively we can use taming-transformers to train on custom dataset and convert to jax: I may be able to try it but any volunteer would be appreciated (on their local GPU or on our TPU VM)
ideally we need to finish by Friday latest so we have at least a week of training for our full model (which will give us the time to finalize our scripts in parallel)
for people working on other tasks, just use pre-trained model for now (refer to Suraj model). This will be our VQGAN if we don’t successfully fine-tuning it in time

Text encoder

select a base model, non-autoregressive + check it handles positioning
can we find a good pre-trained model that does not need fine-tuning (I imagine we would freeze it)

Seq2Seq

Maybe we can adapt jax/hybrid-clip scripts - Suraj mentioned their efficient data loading
loading data logic
loss definition + hyperparameters (research similar papers)

Demo

based on how long it takes to generate images, we could sample from a few and re-rank them with existing OpenAI CLIP
create inference function
it would be cool for our demo to work with huggingface widgets (PR in progress)

As usual, feel free to choose where you want to help!

Finally let’s schedule a call with Suraj.
From his calendar, the best for me would be anytime after 8AM Pacific Time. What would work for you?

Topic		Replies	Views
Vision-Language Project Ideas Flax/JAX Projects	13	1551	June 30, 2021
On language as an information compression heuristic, and how to improve dalle-mini rapidly Beginners	2	548	August 5, 2022
Reproducing and Extending BEIT Flax/JAX Projects	4	1210	July 24, 2021
CLIP like contrastive vision-language models for German with pre-traind text and vision models Flax/JAX Projects	5	1828	July 4, 2021
PreTrain GPT-2 from scratch for German on novel GC4 dataset Flax/JAX Projects	7	1201	July 2, 2021