Dataset dictionary of lists vs lists of dictionary features

I was looking at example datasets from existing datasets and tutorials, and most dataset use a dictionary of lists for their annotations, to examine in the object detection case, most datasets have a format for each image of:

* `image`: PIL.Image.Image object containing the image.
* `image_id`: The image ID.
* `height`: The image height.
* `width`: The image width.
* `objects`: A dictionary containing bounding box metadata for the objects in the image:
  * `id`: The annotation id.
  * `area`: The area of the bounding box.
  * `bbox`: The object’s bounding box (in the [coco](https://albumentations.ai/docs/getting_started/bounding_boxes_augmentation/#coco) format).
  * `category`: The object’s category, with possible values
{
    'image': <PIL.Image.Image>,
    'image_id': 1,
    'height': 480,
    'width': 640,
    'objects': {
        'id': [1, 2],
        'area': [100, 200],
        'bbox': [[10, 10, 50, 50], [60, 60, 80, 80]],
        'category': [0, 1]
    }
}

Where the members of “objects” are lists with each index corresponding to an annotation

But I have chosen to make my dataset format:

* `image`: PIL.Image.Image object containing the image.
* `image_id`: The image ID.
* `height`: The image height.
* `width`: The image width.
* `objects`: A **list of dictionararies** containing bounding box metadata for the objects in the image:
  * `id`: The annotation id.
  * `area`: The area of the bounding box.
  * `bbox`: The object’s bounding box (in the [coco](https://albumentations.ai/docs/getting_started/bounding_boxes_augmentation/#coco) format).
  * `category`: The object’s category, with possible values
{
    'image': <PIL.Image.Image>,
    'image_id': 1,
    'height': 480,
    'width': 640,
    'objects': [
        {'id': 1, 'area': 100, 'bbox': [10, 10, 50, 50], 'category': 0},
        {'id': 2, 'area': 200, 'bbox': [60, 60, 80, 80], 'category': 1}
    ]
}

because it is more easily digestible by the training algorithms (no need to change the format) and it is more easily convertible to COCO JSON

Q1) Are there any potential downsides to using this “list of dictionaries” format that I might be missing?
Q2) Is there a recommended “best practice” for custom object detection datasets, or is it flexible as long as the data can be correctly interpreted?

Note: I have mostly looked at the Object Detection datasets, so my perspective is limited as to recommended
Note: I am using the terms “list” and “dictionary” as the Pythonic terms, the arrow database that the datasets library uses may use different terms

1 Like

Q1

When using Hugging Face’s datasets library, you can convert to native format simply by using dataset.from_list(list_of_dict), so I don’t think there is much overhead.

i wrote a simple benchmark to

import datasets
import time
import random

# Paths to the datasets
DATASET_PREFIX = "./data_engine/annotated_ds"
DATASET_A_PATH = f"{DATASET_PREFIX}_A"
DATASET_B_PATH = f"{DATASET_PREFIX}_B"

RANDOM_ACESSES = 10_000


def benchmark_all_annotations(dataset, format_type):
    start_time = time.time()
    total_annotations = 0
    if format_type == "A":
        for item in dataset:
            # Access all lists within the 'objects' dictionary
            if "objects" in item and item["objects"]:
                _ = item["objects"]
                total_annotations += len(item["objects"]["bbox"])
    elif format_type == "B":
        for item in dataset:
            # Iterate through the list of dictionaries
            if "annotations" in item and item["annotations"]:
                _ = item["annotations"]
                total_annotations += len(item["annotations"])
    end_time = time.time()
    print(f"Time to access all annotations in {format_type} format: {end_time - start_time:.4f} seconds for {total_annotations} annotations")


def benchmark_random_annotations(dataset, format_type, num_accesses=RANDOM_ACESSES):
    start_time = time.time()
    accessed_annotations = 0
    if format_type == "A":
        # For format A, we need to pick a random image, then a random annotation within that image
        image_indices = list(range(len(dataset)))
        for _ in range(num_accesses):
            if not image_indices:
                break
            random_image_idx = random.choice(image_indices)
            _ = dataset[random_image_idx]
            accessed_annotations += 1
    elif format_type == "B":
        # For format B, we need to pick a random image, then a random annotation within that image
        # This assumes 'annotations' is a list of dicts per image
        image_indices = list(range(len(dataset)))
        for _ in range(num_accesses):
            if not image_indices:
                break
            random_image_idx = random.choice(image_indices)
            _ = dataset[random_image_idx]
            accessed_annotations += 1
    end_time = time.time()
    print(f"Time to access {accessed_annotations} random annotations in {format_type} format: {end_time - start_time:.4f} seconds")


def main():
    print("Loading datasets...")
    try:
        ds_a = datasets.load_from_disk(DATASET_A_PATH)
        print(f"Dataset A loaded from {DATASET_A_PATH}")
    except Exception as e:
        print(f"Could not load Dataset A: {e}")
        ds_a = None

    try:
        ds_b = datasets.load_from_disk(DATASET_B_PATH)
        print(f"Dataset B loaded from {DATASET_B_PATH}")
    except Exception as e:
        print(f"Could not load Dataset B: {e}")
        ds_b = None

    if ds_a:
        print("\nBenchmarking Dataset A (Format A):")
        benchmark_all_annotations(ds_a, "A")
        benchmark_random_annotations(ds_a, "A")

    if ds_b:
        print("\nBenchmarking Dataset B (Format B):")
        benchmark_all_annotations(ds_b, "B")
        benchmark_random_annotations(ds_b, "B")


if __name__ == "__main__":
    main()

With a small dataset of 1159 images with bounding boxes, no segmentations

$ py benchmark_dataset_formats.py
Loading datasets...
Dataset A loaded from ./data_engine/annotated_ds_A
Dataset B loaded from ./data_engine/annotated_ds_B

Benchmarking Dataset A (Format A):
Time to access all annotations in A format: 3.0149 seconds for 1566 annotations
Time to access 10000 random annotations in A format: 23.5742 seconds

Benchmarking Dataset B (Format B):
Time to access all annotations in B format: 2.9000 seconds for 1566 annotations
Time to access 10000 random annotations in B format: 23.5720 seconds

with a big dataset of 122218 images with bounding boxes, no segmentations

$ py tests/benchmark_dataset_formats.py
Loading datasets...
Dataset A loaded from ./data_engine/coco_ds_A
Loading dataset from disk: 100%|█████████████████████████████████████| 18/18 [00:00<00:00, 62.18it/s]
Dataset B loaded from ./data_engine/coco_ds_B

Benchmarking Dataset A (Format A):
Time to access all annotations in A format: 176.1176 seconds for 878943 annotations
Time to access 100000 random annotations in A format: 131.2404 seconds

Benchmarking Dataset B (Format B):
Time to access all annotations in B format: 643.0321 seconds for 28124768 annotations
Time to access 100000 random annotations in B format: 503.7582 seconds

Conclusion of benchmark

There is definitely a performance difference, for loading annotations, to what extent 1s/1000images vs 5s/images matter is questionable, though, still Format A is best practice, so I would recommend format A, based on my limited knowledge

Note: the adding or changing of annotations is not benchmarked here

Note: the benchmarks were run 3 times to ensure consistency, only the first results is shown

1 Like