DALL-E - mini version

boris · July 7, 2021, 4:37pm

Notes from meeting with Suraj:

we should try training the VQGAN on a TPU VM, no need for full YFCC100M subset - let’s just use our existing dataset + part of YFCC100M that can fit on VM
script for VQGAN uses pytorch lightning, if we can update it then we could take advantage of wandb for automatically pushing checkpoints (recent feature of the callback)
for the full model, it may be better not to freeze the text as the pretrained encoder is trained on different type of data
there is some existing Seq2Seq script that we should be able to directly adapt
- we give input ids (raw text) + output ids (image encoded by VQGAN)
- since output is different from pretrained model, it will set random weights so we need to reload manually the encoder part from a pretrained model
- we should build the dataset with preprocessed images (encoded with VQGAN) so the data loading is faster

Things I forgot to ask:

text preprocessing
- data has title + description + usertags
- should we concatenate it all or just keep description or title (need to explore)
- I tend to think of either keeping just description as this is what a user may input or maybe a random mix of all
- See example field here

Topic		Replies	Views
Generate GIF reply to English text with VQGAN + CLIP Flax/JAX Projects	23	3317	July 2, 2021
I need The implications of dalle2 and CogView2 model 🤗Transformers	0	217	August 15, 2022
Train the Best Sentence Embedding Model Ever with 1B Training Pairs Flax/JAX Projects	36	25623	July 2, 2023
Image captioning for French with pre-trained vision and text model Flax/JAX Projects	6	2161	January 4, 2022
Creation of Images from Text-Prompt (Customized Training) Beginners	37	531	January 15, 2025