Hello,
Is there a non-hacky way to programmatically retrieve a list of pairs of lower/upper bounds of the different acceptable dataset size categories.
I found that parsing the docstring from the huggingface_hub lib is getting the info but…
import requests
import re
def get_size_categories():
url = "https://raw.githubusercontent.com/huggingface/huggingface_hub/main/src/huggingface_hub/repocard_data.py"
response = requests.get(url)
content = response.text
pattern = r"size_categories.*?Options are: (.*?)\."
match = re.search(pattern, content, re.DOTALL)
if match:
categories_str = match.group(1)
categories = re.findall(r"'([^']*)'", categories_str)
return categories
else:
return []
size_categories = get_size_categories()
print(size_categories)
This returns :
['n<1K', '1K<n<10K', '10K<n<100K', '100K<n<1M', '1M<n<10M', '10M<n<100M', '100M<n<1B', '1B<n<10B', '10B<n<100B', '100B<n<1T', 'n>1T', 'other']
It is super dirty for 2 main reasons :
- it relies on the docstring to be right (and maintained)
- it requires further parsing to get the numerical bounds values (from human-friendly abbreviations with ‘K’, ‘T’, etc…)