Access to gated repositories

jerome-white · January 8, 2025, 12:06pm

I would like to read a file from a repository:

In [1]: import pandas as pd

In [2]: url = 'hf://datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/Qwen__Qwen27B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json'
In [3]: df = pd.read_json(url)
[ ... clipped ... ]
HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/resolve/main/Qwen__Qwen2-7B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json

The above exception was the direct cause of the following exception:
[ ... clipped ... ]
GatedRepoError: 403 Client Error. (Request ID: Root=1-677e5a26-2b5127be18a8627d7ade2b28;1bb7097f-b2b1-4e2d-bb9f-fe47b4b0b984)

Cannot access gated repo for url https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/resolve/main/Qwen__Qwen2-7B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json.
Access to dataset open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details is restricted and you are not in the authorized list. Visit https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details to ask for access.

Is there a programmatic way around this error? I can of course manually visit the website suggested and click the “accept” button, but it would be more convenient to do the same thing via an API – is that possible?

I’m logged in (huggingface-cli login) and my token is in my environment.

leomaurodesenv · January 8, 2025, 12:18pm

I guess that is happening because access is restricted, as you can see:
https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/blob/main/Qwen__Qwen2-7B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json

I suggest you download the dataset using datasets.load_dataset() and navigate through the files.

jerome-white · January 8, 2025, 1:40pm

That’s right. What I’m after is a programmatic way to accept the agreement, rather than having to visit the HuggingFace website (or using Selenium to do it for me).

I’ve found datasets.load_dataset difficult to add to automated workflows. Accessing files directly has been much more straightforward for me.

leomaurodesenv · January 8, 2025, 2:32pm

I see, it is easier to create your own logic. Anyway, I would say it is only possible to access this file via load_dataset method.

Additionally, you can filter the data that you are looking for using data_files parameter; for example:

from datasets import load_dataset

subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz")

subset = load_dataset("allenai/c4", data_dir="en")

jerome-white · January 9, 2025, 5:08am

It looks like manually requesting access is by design. From the Gated datasets documentation:

Requesting access can only be done from your browser.

I’ve made a feature request for API access here.

jerome-white · January 10, 2025, 5:47am

The current suggestion from the Hugging Face team is to use the method outlined here. There’s an undocumented endpoint from which we can “ask access.”

Alanturner2 · January 10, 2025, 8:10am

Hi there!
If you encounter the error GatedRepoError while trying to access a gated dataset on Hugging Face, it indicates that you don’t yet have access to the dataset, or you haven’t accepted its terms and conditions.

Steps to Resolve:

Manually Accept Access:
Visit the dataset page, e.g., open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details, and click the “Access Repository” button. Once you’ve done this, you should be able to access the dataset programmatically.
Programmatic Access:
Hugging Face currently doesn’t provide a direct API to accept dataset terms programmatically. However, here’s a way you can simplify the process:
- Ensure Your Token is Set: Log in with huggingface-cli login, or manually set the HUGGINGFACE_TOKEN environment variable.
- Check Access:
  You can use the requests library in Python to confirm your access programmatically:
```
import requests

url = 'https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details'
headers = {'Authorization': f'Bearer {your_huggingface_token}'}
response = requests.get(url, headers=headers)
if response.status_code == 200:
    print("Access granted")
else:
    print("Access denied. Visit the page to request access.")
```
  Replace your_huggingface_token with your actual Hugging Face token.

Access After Authorization:
Once you’ve been granted access (manually or after approval), you can proceed with:

import pandas as pd
url = 'hf://datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/Qwen__Qwen27B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json'
df = pd.read_json(url)
print(df.head())

Hope this help!

Topic		Replies	Views
Unable to access/download datasets 🤗Datasets	0	199	June 7, 2024
Unable to download datasets Beginners	0	232	June 7, 2024
Access denied when reading files in dataset 🤗Datasets	4	1736	September 23, 2021
Cannot access gated repo Llama-2-7b-hf 🤗AutoTrain	9	11511	November 2, 2024
Error loading a CSV file from a private repo Beginners	1	249	November 25, 2022

Access to gated repositories

Steps to Resolve:

Related topics