Creating a object detection data set from one folder of several video frames

Hi all,

I have spent the last few weeks trying to get my head around hugging face and its possibilities. I have been unable to figure out how to create a data set from one folder with just under 18000 video frames from several videos. I exported the video frames using CVAT for videos.
I have one XML file with all the bounding box point data, and I am wanting to create a data set that can then be split to the train test and validate datasets.

Then hopefully figure out how to use the pipeline() function to fine-tune a model. I am stuck and need some guidance on what to do next.

Thank you.

For others who get stuck, I have got to the stage now that I can begin fine-tuning a pre-trained model by following these steps.

  1. finding the desired annotation format for the selected pre-trained model. In my case, it was DETR and one JSON file with all the data was required in a specific format that was found by looking at DETR (

  2. split the images into train, validate and test datasets. As the hugging face hub only allows 10,000 files per dataset.

  3. for each dataset, create a JSON file in the correct format with the frames data. I wrote a bespoke script in python to do this.

  4. write a script to fine-tune the selected pre-trained model.