Load a COCO format database from disk for DETR

Godouche · May 6, 2025, 12:13pm

I have a COCO database in my disk (with a JSON in the annotations folder that contains image directions) and I would like to load it in HF dataset in orther to use CV models.

Is there a function that allows that?

John6666 · May 7, 2025, 1:56am

Hmm… This?

github.com/huggingface/datasets

Add COCO datasets

opened 07:48AM - 21 Jun 21 UTC

NielsRogge

dataset request vision

## Adding a Dataset - **Name:** COCO - **Description:** COCO is a large-scale …object detection, segmentation, and captioning dataset. - **Paper + website:** https://cocodataset.org/#home - **Data:** https://cocodataset.org/#download - **Motivation:** It would be great to have COCO available in HuggingFace datasets, as we are moving beyond just text. COCO includes multi-modalities (images + text), as well as a huge amount of images annotated with objects, segmentation masks, keypoints etc., on which models like DETR (which I recently added to HuggingFace Transformers) are trained. Currently, one needs to download everything from the website and place it in a local folder, but it would be much easier if we can directly access it through the datasets API. Instructions to add a new dataset can be found [here](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md).

lhoestq · May 7, 2025, 12:45pm

There is no COCO loader in the datasets library, but it would be a welcomed contribution in my opinion.

All the existing data modules are listed here

Godouche · May 14, 2025, 12:48pm

I wrote this code for loading COCO datasets in hugging face datasets that works with DETR,

Adaptations:

features of your COCO JSON file
path to COCO folder in local

import json
import os
import subprocess
from datasets import DatasetDict, Dataset, Features, Value, Sequence, ClassLabel, Image

# Ensure the datasets module is installed
subprocess.check_call(["pip", "install", "datasets"])

class CocoDatasetLoader:
    def __init__(self, coco_folder):
        self.coco_folder = coco_folder

    def group_by_key_id(self, data, key_id, category_id_to_index):
        """
        Groups data by a specified key and maps category IDs to indices.
        
        Args:
            data (list): List of dictionaries containing the data.
            key_id (str): The key to group by.
            category_id_to_index (dict): Mapping from category IDs to indices.
            
        Returns:
            dict: Grouped data.
        """
        grouped_data = {}
        for item in data:
            key_value = item[key_id]
            if key_value not in grouped_data:
                grouped_data[key_value] = {k: [] for k in item.keys() if k != key_id}
            for k, v in item.items():
                if k != key_id:
                    grouped_data[key_value][k].append(v)
            grouped_data[key_value]['category'] = [category_id_to_index[x] for x in grouped_data[key_value]['category_id']]
        return grouped_data
    
    def load_coco_hf_dataset(self, split):
        """
        Loads COCO dataset and processes it into a format suitable for Hugging Face datasets.
        
        Args:
            split (str): Dataset split (e.g., 'Train', 'Test', 'Validation').
            
        Returns:
            Dataset: HuggingFace Dataset of the split of COCO dataset.
        """
        # Load the JSON file
        json_file_path = os.path.join(self.coco_folder, f'annotations/instances_{split}.json')
        try:
            with open(json_file_path, 'r') as f:
                coco_data = json.load(f)
        except FileNotFoundError:
            print(f"File not found: {json_file_path}")
            return []

        # Extract category names and create a mapping from category IDs to indices
        category_names = [cat['name'] for cat in coco_data['categories']]
        category_id_to_index = {cat['id']: idx for idx, cat in enumerate(coco_data['categories'])}

        # Group annotations by 'image_id'
        grouped_annotations = self.group_by_key_id(coco_data['annotations'], 'image_id', category_id_to_index)

        # Create a dictionary of images
        grouped_images = {item['id']: item for item in coco_data['images']}

        # Initialize 'objects' field in grouped_images
        annotations_keys = list(grouped_annotations.values())[0].keys()
        for k, v in grouped_images.items():
            grouped_images[k]['objects'] = {key: [] for key in annotations_keys}

        # Populate 'objects' field with annotations
        for k, v in grouped_annotations.items():
            grouped_images[k]['objects'] = v

        # Add image paths and IDs
        for k, v in grouped_images.items():
            v['image'] = os.path.join(self.coco_folder, 'images', split, v['file_name'])
            v['image_id'] = v['id']

        # Create a Hugging Face dataset from the custom data using from_list for efficiency
        hf_dataset = Dataset.from_list(list(grouped_images.values()))

        # Define the features for the main dataset
        features = Features({
            'id': Value('int64'),
            'image_id': Value('int64'),
            'image': Image(),
            'file_name': Value('string'),
            'license': Value('string'),
            'flickr_url': Value('string'),
            'coco_url': Value('string'),
            'date_captured': Value('string'),
            'width': Value('int64'),
            'height': Value('int64'),
            'objects': Sequence({
                'id': Value('int64'),
                'area': Value('float32'),
                'bbox': Sequence(Value('float32')),
                'category': ClassLabel(names=category_names),
                'attributes': {'occluded': Value('bool')},
                'category_id': Value('int64'),
                'iscrowd': Value('int64'),
                'segmentation': {
                    'counts': Sequence(Value('int64')),
                    'size': Sequence(Value('int64'))
                }
            })
        })

        # Cast the features for the Hugging Face dataset
        hf_dataset = hf_dataset.cast(features)

        return hf_dataset

# Initialize the CocoDatasetLoader class
coco_loader = CocoDatasetLoader('/path/to/coco/folder/')

hf_dataset_dict = DatasetDict()
for split in ['Train', 'Test', 'Validation']:
    # Load the COCO dataset for each split
    hf_dataset = coco_loader.load_coco_hf_dataset(split)
    
    # Print the dataset
    print(f"Dataset for {split} split:")
    print(hf_dataset)
    
    # Create a DatasetDict with the split
    hf_dataset_dict[split.lower()] = hf_dataset

system · May 15, 2025, 12:48am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
HF Dataset to COCO format dataset 🤗Datasets	5	1115	December 31, 2023
Prepare dataset from YOLO format to COCO for DETR 🤗Transformers	4	5152	May 6, 2025
Help making object detection dataset Beginners	4	51	April 26, 2025
Use autotrain to train object detection models from a pretrained backbone 🤗AutoTrain	0	71	May 20, 2025
From where can I import the get_coco_api_from_dataset module? 🤗Datasets	5	3610	August 8, 2022

Load a COCO format database from disk for DETR

Related topics