Access to gated repositories

I would like to read a file from a repository:

In [1]: import pandas as pd

In [2]: url = 'hf://datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/Qwen__Qwen27B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json'
In [3]: df = pd.read_json(url)
[ ... clipped ... ]
HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/resolve/main/Qwen__Qwen2-7B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json

The above exception was the direct cause of the following exception:
[ ... clipped ... ]
GatedRepoError: 403 Client Error. (Request ID: Root=1-677e5a26-2b5127be18a8627d7ade2b28;1bb7097f-b2b1-4e2d-bb9f-fe47b4b0b984)

Cannot access gated repo for url https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/resolve/main/Qwen__Qwen2-7B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json.
Access to dataset open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details is restricted and you are not in the authorized list. Visit https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details to ask for access.

Is there a programmatic way around this error? I can of course manually visit the website suggested and click the “accept” button, but it would be more convenient to do the same thing via an API – is that possible?

I’m logged in (huggingface-cli login) and my token is in my environment.

1 Like

I guess that is happening because access is restricted, as you can see:
https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/blob/main/Qwen__Qwen2-7B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json

I suggest you download the dataset using datasets.load_dataset() and navigate through the files.

1 Like

That’s right. What I’m after is a programmatic way to accept the agreement, rather than having to visit the HuggingFace website (or using Selenium to do it for me).

I’ve found datasets.load_dataset difficult to add to automated workflows. Accessing files directly has been much more straightforward for me.

1 Like

I see, it is easier to create your own logic. Anyway, I would say it is only possible to access this file via load_dataset method.

Additionally, you can filter the data that you are looking for using data_files parameter; for example:

from datasets import load_dataset

subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz")

subset = load_dataset("allenai/c4", data_dir="en")
1 Like

It looks like manually requesting access is by design. From the Gated datasets documentation:

Requesting access can only be done from your browser.

I’ve made a feature request for API access here.

1 Like

The current suggestion from the Hugging Face team is to use the method outlined here. There’s an undocumented endpoint from which we can “ask access.”

1 Like

Hi there!
If you encounter the error GatedRepoError while trying to access a gated dataset on Hugging Face, it indicates that you don’t yet have access to the dataset, or you haven’t accepted its terms and conditions.

Steps to Resolve:

  1. Manually Accept Access:
    Visit the dataset page, e.g., open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details, and click the “Access Repository” button. Once you’ve done this, you should be able to access the dataset programmatically.

  2. Programmatic Access:
    Hugging Face currently doesn’t provide a direct API to accept dataset terms programmatically. However, here’s a way you can simplify the process:

    • Ensure Your Token is Set: Log in with huggingface-cli login, or manually set the HUGGINGFACE_TOKEN environment variable.
    • Check Access:
      You can use the requests library in Python to confirm your access programmatically:
      import requests
      
      url = 'https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details'
      headers = {'Authorization': f'Bearer {your_huggingface_token}'}
      response = requests.get(url, headers=headers)
      if response.status_code == 200:
          print("Access granted")
      else:
          print("Access denied. Visit the page to request access.")
      
      Replace your_huggingface_token with your actual Hugging Face token.
  3. Access After Authorization:
    Once you’ve been granted access (manually or after approval), you can proceed with:

    import pandas as pd
    url = 'hf://datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/Qwen__Qwen27B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json'
    df = pd.read_json(url)
    print(df.head())
    

Hope this help!

1 Like