I would like to read a file from a repository:
In [1]: import pandas as pd
In [2]: url = 'hf://datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/Qwen__Qwen27B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json'
In [3]: df = pd.read_json(url)
[ ... clipped ... ]
HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/resolve/main/Qwen__Qwen2-7B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json
The above exception was the direct cause of the following exception:
[ ... clipped ... ]
GatedRepoError: 403 Client Error. (Request ID: Root=1-677e5a26-2b5127be18a8627d7ade2b28;1bb7097f-b2b1-4e2d-bb9f-fe47b4b0b984)
Cannot access gated repo for url https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/resolve/main/Qwen__Qwen2-7B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json.
Access to dataset open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details is restricted and you are not in the authorized list. Visit https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details to ask for access.
Is there a programmatic way around this error? I can of course manually visit the website suggested and click the “accept” button, but it would be more convenient to do the same thing via an API – is that possible?
I’m logged in (huggingface-cli login
) and my token is in my environment.
1 Like
That’s right. What I’m after is a programmatic way to accept the agreement, rather than having to visit the HuggingFace website (or using Selenium to do it for me).
I’ve found datasets.load_dataset
difficult to add to automated workflows. Accessing files directly has been much more straightforward for me.
1 Like
I see, it is easier to create your own logic. Anyway, I would say it is only possible to access this file via load_dataset
method.
Additionally, you can filter the data that you are looking for using data_files
parameter; for example:
from datasets import load_dataset
subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz")
subset = load_dataset("allenai/c4", data_dir="en")
1 Like
It looks like manually requesting access is by design. From the Gated datasets documentation:
Requesting access can only be done from your browser.
I’ve made a feature request for API access here.
1 Like
The current suggestion from the Hugging Face team is to use the method outlined here. There’s an undocumented endpoint from which we can “ask access.”
1 Like
Hi there!
If you encounter the error GatedRepoError
while trying to access a gated dataset on Hugging Face, it indicates that you don’t yet have access to the dataset, or you haven’t accepted its terms and conditions.
Steps to Resolve:
-
Manually Accept Access:
Visit the dataset page, e.g., open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details, and click the “Access Repository” button. Once you’ve done this, you should be able to access the dataset programmatically.
-
Programmatic Access:
Hugging Face currently doesn’t provide a direct API to accept dataset terms programmatically. However, here’s a way you can simplify the process:
- Ensure Your Token is Set: Log in with
huggingface-cli login
, or manually set the HUGGINGFACE_TOKEN
environment variable.
- Check Access:
You can use the requests
library in Python to confirm your access programmatically:import requests
url = 'https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details'
headers = {'Authorization': f'Bearer {your_huggingface_token}'}
response = requests.get(url, headers=headers)
if response.status_code == 200:
print("Access granted")
else:
print("Access denied. Visit the page to request access.")
Replace your_huggingface_token
with your actual Hugging Face token.
-
Access After Authorization:
Once you’ve been granted access (manually or after approval), you can proceed with:
import pandas as pd
url = 'hf://datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/Qwen__Qwen27B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json'
df = pd.read_json(url)
print(df.head())
Hope this help!
1 Like