Where is the piece of code that computes dataset_info for each dataset uploaded to the hub?

BrunoHays · November 14, 2024, 8:29pm

As you may have noticed, each dataset uploaded to the huggingface hub is parsed automatically and the relevant informations are available at an URL like this:

https://datasets-server.huggingface.co/info?dataset=lhoestq/demo1

I’m interested in understanding how this parsing is performed, but I don’t know where to look for the source code.

It doesn’t seem to be in the datasets library. Maybe it’s in the huggingface_hub library ? Or maybe it’s not open-source ?

Any pointer would be appreciated

John6666 · November 15, 2024, 12:34am

I don’t know much about datasets, but there are a few projects on github that seem to be related.
Not all of the services inside the HF server are open source, though. I guess that’s just how it has to be for security reasons.

BrunoHays · November 15, 2024, 9:40am

Thankks for the reply.
It looks like it is somewhere in dataset-viewer indeed:

github.com

huggingface/dataset-viewer/blob/main/docs/source/info.md?plain=1

# Get dataset information

The dataset viewer provides an `/info` endpoint for exploring the general information about dataset, including such fields as description, citation, homepage, license and features.

The `/info` endpoint accepts two query parameters:

- `dataset`: the dataset name
- `config`: the subset name

<inferencesnippet>
<python>
```python
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/info?dataset=ibm/duorc&config=SelfRC"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()
```

This file has been truncated. show original

John6666 · November 15, 2024, 9:56am

It’s on the public part…

Topic		Replies	Views
DatasetInfo seems to be missing when I pull my dataset from HFHub 🤗Datasets	0	29	July 17, 2024
Recent breaking changes in `api.dataset_info`? 🤗Datasets	3	71	January 9, 2025
Dataset list error - returns partial information 🤗Hub	1	462	January 24, 2023
Dataset Description 🤗Datasets	0	69	July 11, 2024
Custom loading dataset script 🤗Datasets	4	511	January 3, 2023

Where is the piece of code that computes dataset_info for each dataset uploaded to the hub?

Related topics