Does the REST API work with private repo?

I was experimenting with the REST API with a private repo. Despite providing the user access token in the request header, I receive an error

import requests
from dotenv import load_dotenv
load_dotenv()
per_token = os.getenv('API_PER_TOKEN')
headers = {"Authorization": f"Bearer {per_token}"}
API_URL = "https://datasets-server.huggingface.co/is-valid?dataset=sl02/np-datasets"
def query():
    response = requests.request("GET", API_URL, headers=headers)
    return response.json()
data = query()

{'error': 'The dataset does not exist, or is not accessible without authentication (private or gated). Please retry with authentication.'}
However, when I make the repository public, it returns {'valid': True}. But, when I run the first-rows API, I get the following message

import requests
from dotenv import load_dotenv
load_dotenv()
per_token = os.getenv('API_PER_TOKEN')
headers = {"Authorization": f"Bearer {per_token}"}
API_URL = "https://datasets-server.huggingface.co/first-rows?dataset=sl02/np-datasets&config=default&split=train"
def query():
    response = requests.request("GET", API_URL)
    return response.json()
data = query()

{'error': 'The response is not ready yet. Please retry later.'}

The load_dataset() works in private mode when I set the use_auth_token argument. Any clue what I missing here?

1 Like

Maybe @severo knows more, but IIRC the REST API is not available yet for private repos.

Hi @sl02. The REST API uses the same rule as the dataset viewer (see The Dataset Preview has been disabled on this dataset - #6 by severo): it’s not available at all for the private datasets for now.

re “The response is not ready yet. Please retry later”: the responses to the API endpoints are pre-computed asynchronously and can take some time to be processed, depending on the dataset itself and on the load of the servers.

Hello! The dataset preview is now available for the Pro accounts. Should not it be the case for the API? I cannot do something as simple as retrieving the URLs. Thanks!

headers = {"Authorization": f"Bearer {API_TOKEN}"}

reseponse = requests.get(f"https://datasets-server.huggingface.co/parquet?dataset={dataset_name}")
json_data = reseponse.json()

urls = [f['url'] for f in json_data['parquet_files'] if f['split'] == 'test']

Update

So now this works:

from datasets import load_dataset
import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = f"https://huggingface.co/api/datasets/{dataset_name}/parquet"

def query():
    response = requests.get(API_URL, headers=headers)
    json_data = response.json()["default"]
    return json_data

urls = query()
print(urls)

However, if we try to download the retrieved URL, it does not work FileNotFoundError

test_dataset = load_dataset("parquet",
                            data_files={"test": urls["test"]},
                            split="test",
                            token=API_TOKEN
                            )

The only solution I found so far, is to manually download the retrieved URLs, something like:

# Manually download the files

import shutil
from tqdm.auto import tqdm

parquet_files = []

for n, url in tqdm(enumerate(urls["test"]), total=len(urls["test"])):

  response = requests.get(url, headers=headers, stream=True)

  with open(f"{n}.parquet", "wb") as f:
      shutil.copyfileobj(response.raw, f)
      parquet_files.append(f"{n}.parquet")


# Load dataset
test_dataset = load_dataset("parquet", data_files=parquet_files)

print(test_dataset)
1 Like

Hi ! you can load the parquet files from the repo directly:

load_dataset(dataset_name, revision="refs/convert/parquet")

and if you want to load specific files you can pass data_files=[...] (btw it accepts glob patterns)

1 Like

Thanks! I still receive FileNotFoundError. The issue, as in the original post, is that the repository is private. It is my repository, and I am logged in with an access token.

1 Like

Can you check that your token has the right permissions ? I just tried on my side and I couldn’t reproduce the FileNotFoundError on a the parquet branch of a private repo with a token

1 Like