Owl-vit training on custom dataset from scratch

Do you have a owl-vit full training example with custom dataset from scratch? I don’t understand what to do from the manual:

python -m scenic.projects.owl_vit.main \ --alsologtostderr=true \ --workdir=/tmp/training \ --config=scenic/projects/owl_vit/configs/clip_b32_finetune.py

I have a dataset with 30k images. I need description of each images instead a simple label? I need to make some kind of conversion? I need to train a CLIP model before? Someone could explain all step to make a training on custom dataset? Thanks.