Notes from meeting with Suraj:
- we should try training the VQGAN on a TPU VM, no need for full YFCC100M subset - let’s just use our existing dataset + part of YFCC100M that can fit on VM
- script for VQGAN uses pytorch lightning, if we can update it then we could take advantage of wandb for automatically pushing checkpoints (recent feature of the callback)
- for the full model, it may be better not to freeze the text as the pretrained encoder is trained on different type of data
- there is some existing Seq2Seq script that we should be able to directly adapt
- we give input ids (raw text) + output ids (image encoded by VQGAN)
- since output is different from pretrained model, it will set random weights so we need to reload manually the encoder part from a pretrained model
- we should build the dataset with preprocessed images (encoded with VQGAN) so the data loading is faster
Things I forgot to ask:
- text preprocessing
- data has title + description + usertags
- should we concatenate it all or just keep description or title (need to explore)
- I tend to think of either keeping just description as this is what a user may input or maybe a random mix of all
- See example field here