List of `size_categories`

Hello,

Is there a non-hacky way to programmatically retrieve a list of pairs of lower/upper bounds of the different acceptable dataset size categories.

I found that parsing the docstring from the huggingface_hub lib is getting the info but…

import requests
import re

def get_size_categories():
    url = "https://raw.githubusercontent.com/huggingface/huggingface_hub/main/src/huggingface_hub/repocard_data.py"
    response = requests.get(url)
    content = response.text
    
    pattern = r"size_categories.*?Options are: (.*?)\."
    match = re.search(pattern, content, re.DOTALL)
    
    if match:
        categories_str = match.group(1)
        categories = re.findall(r"'([^']*)'", categories_str)
        return categories
    else:
        return []

size_categories = get_size_categories()
print(size_categories)

This returns :

['n<1K', '1K<n<10K', '10K<n<100K', '100K<n<1M', '1M<n<10M', '10M<n<100M', '100M<n<1B', '1B<n<10B', '10B<n<100B', '100B<n<1T', 'n>1T', 'other']

It is super dirty for 2 main reasons :

  • it relies on the docstring to be right (and maintained)
  • it requires further parsing to get the numerical bounds values (from human-friendly abbreviations with ‘K’, ‘T’, etc…)
1 Like