Huggingface_hub list_datasets call

Hi everyone,

I’m working with the huggingface_hub client library, which works so smooth!

The reason to create this post is that I noticed that when calling the function list_datasets with the parameter full=True, the siblings field (which has the names of the repository files) is always None.
However, when calling the function list_repo_files with the parameter repo_type set to “dataset” we can retrieve all the files in the repository.

siblings attribute are of the class ModelFile. Is there an ongoing implementation for the DatasetFile or another way to retrieve the filenames of a dataset repository with the list_datasets call?

Thank you in advance

Hello there,

list_datasets function is used to filter all datasets on Hub with a given filter, meanwhile list_repo_files iterates over files of a given repository and thus you get siblings. Since they serve different purposes on different scopes, I’d suggest you to use list_repo_files to list siblings in a given repository.

Hi @merve,

I understand the idea here, but I was wondering is why is there a field in list_datasets while never gives back the file’s list.

Hi! I agree we should either fetch this info with full=True or remove the field. cc @Wauplin

The idea is that we have in huggingface_hub a ModelInfo object to describe a model. Depending on the use case, not all information is fetched from the server (especially listing all files from each repo when listing all repos). If an information is not fetched, is it set to None. Maybe not optimal but not sure we want to change this anytime soon. A solution to change this would be to have different classes ModelInfoXXX depending on the context but not sure it will be easier to use from a user perspective.

1 Like