Current Status Summary
Repo
- on github
- on huggingface - we’ll push from github at the end + add models
- Workflow: I’m adding everyone as collaborator on the github (send me your username). As we need to be fast I suggest that we do “PR + 1 approval from anybody = merge to main branch”. Small updates (typo’s, quick bug fix, readme…) may not even need approval but just notify on the discord
General Architecture
- Seq2Seq
- input is tokenized text (with a text encoder)
- output is tokenized image (with VQGAN)
Datasets
- Conceptual 12M data prepared by @greeneggsandyaml
- Conceptual 3M data prepared by @khalidsaifullaah
- YFCC100M: I’m working on creating the OpenAI subset on my local machine (looking good so far, I expect 2TB max). If it works I’ll try to upload to datasets for streaming, I created a post to see if it’s feasible
- Can somebody prepare a mini dataset that can be easily shared with others and used for colab prototyping of the different tasks?
VQGAN
- there is an existing jax model
- needs to be finetuned on our dataset
- @lkhphuc is trying to make a jax training script (no existing one available)
- alternatively we can use taming-transformers to train on custom dataset and convert to jax: I may be able to try it but any volunteer would be appreciated (on their local GPU or on our TPU VM)
- ideally we need to finish by Friday latest so we have at least a week of training for our full model (which will give us the time to finalize our scripts in parallel)
- for people working on other tasks, just use pre-trained model for now (refer to Suraj model). This will be our VQGAN if we don’t successfully fine-tuning it in time
Text encoder
- select a base model, non-autoregressive + check it handles positioning
- can we find a good pre-trained model that does not need fine-tuning (I imagine we would freeze it)
Seq2Seq
- Maybe we can adapt jax/hybrid-clip scripts - Suraj mentioned their efficient data loading
- loading data logic
- loss definition + hyperparameters (research similar papers)
Demo
- based on how long it takes to generate images, we could sample from a few and re-rank them with existing OpenAI CLIP
- create inference function
- it would be cool for our demo to work with huggingface widgets (PR in progress)
As usual, feel free to choose where you want to help!
Finally let’s schedule a call with Suraj.
From his calendar, the best for me would be anytime after 8AM Pacific Time. What would work for you?