How do I create an Multi-label Image classification dataset?

I wanted to finetune a Vision Transformer on my custom Dataset. And I want a model that is something like this: Multiple Object Detector PASCAL 2007 - a Hugging Face Space by archietram
I do have the images, But I dont understand How actually I make this as a Dataset that can be pushed into the hub. Can someone please help me figure this out?