DALL-E - mini version

Notes from meeting with Suraj:

  • we should try training the VQGAN on a TPU VM, no need for full YFCC100M subset - let’s just use our existing dataset + part of YFCC100M that can fit on VM
  • script for VQGAN uses pytorch lightning, if we can update it then we could take advantage of wandb for automatically pushing checkpoints (recent feature of the callback)
  • for the full model, it may be better not to freeze the text as the pretrained encoder is trained on different type of data
  • there is some existing Seq2Seq script that we should be able to directly adapt
    • we give input ids (raw text) + output ids (image encoded by VQGAN)
    • since output is different from pretrained model, it will set random weights so we need to reload manually the encoder part from a pretrained model
    • we should build the dataset with preprocessed images (encoded with VQGAN) so the data loading is faster

Things I forgot to ask:

  • text preprocessing
    • data has title + description + usertags
    • should we concatenate it all or just keep description or title (need to explore)
    • I tend to think of either keeping just description as this is what a user may input or maybe a random mix of all
    • See example field here
1 Like