List of `size_categories`

Aurelien-Morgan · December 21, 2024, 11:49am

Hello,

Is there a non-hacky way to programmatically retrieve a list of pairs of lower/upper bounds of the different acceptable dataset size categories.

I found that parsing the docstring from the huggingface_hub lib is getting the info but…

import requests
import re

def get_size_categories():
    url = "https://raw.githubusercontent.com/huggingface/huggingface_hub/main/src/huggingface_hub/repocard_data.py"
    response = requests.get(url)
    content = response.text
    
    pattern = r"size_categories.*?Options are: (.*?)\."
    match = re.search(pattern, content, re.DOTALL)
    
    if match:
        categories_str = match.group(1)
        categories = re.findall(r"'([^']*)'", categories_str)
        return categories
    else:
        return []

size_categories = get_size_categories()
print(size_categories)

This returns :

['n<1K', '1K<n<10K', '10K<n<100K', '100K<n<1M', '1M<n<10M', '10M<n<100M', '100M<n<1B', '1B<n<10B', '10B<n<100B', '100B<n<1T', 'n>1T', 'other']

It is super dirty for 2 main reasons :

it relies on the docstring to be right (and maintained)
it requires further parsing to get the numerical bounds values (from human-friendly abbreviations with ‘K’, ‘T’, etc…)

Topic		Replies	Views
How do we get the "task_categories" entry of a dataset using the Python library? 🤗Datasets	2	404	August 31, 2022
How do I structure this? 🤗Datasets	2	27	February 19, 2025
Huggingface_hub list_datasets call 🤗Hub	4	2175	October 31, 2022
Introduction to Pagination 🤗Hub	5	803	February 24, 2023
How to get size of a dataset? 🤗Datasets	2	4708	January 29, 2024

List of `size_categories`

Related topics