Creating a object detection data set from one folder of several video frames

Hi all,

I have spent the last few weeks trying to get my head around hugging face and its possibilities. I have been unable to figure out how to create a data set from one folder with just under 18000 video frames from several videos. I exported the video frames using CVAT for videos.
I have one XML file with all the bounding box point data, and I am wanting to create a data set that can then be split to the train test and validate datasets.

Then hopefully figure out how to use the pipeline() function to fine-tune a model. I am stuck and need some guidance on what to do next.

Thank you.

For others who get stuck, I have got to the stage now that I can begin fine-tuning a pre-trained model by following these steps.

  1. finding the desired annotation format for the selected pre-trained model. In my case, it was DETR and one JSON file with all the data was required in a specific format that was found by looking at DETR (huggingface.co)

  2. split the images into train, validate and test datasets. As the hugging face hub only allows 10,000 files per dataset.

  3. for each dataset, create a JSON file in the correct format with the frames data. I wrote a bespoke script in python to do this.

  4. write a script to fine-tune the selected pre-trained model.