Hi,
I am trying to use the duckdb feature described here which allows you to read a table from a private dataset.
I use the following code:
with duckdb.connect() as con:
con.execute(f"""CREATE SECRET hf_token (TYPE HUGGINGFACE, TOKEN '{hf_token}');""")
con.execute("""CREATE TABLE IF NOT EXISTS documents (...)""")
df = con.execute(f"""SELECT * FROM '{dataset_uri}';""").df()
con.sql("INSERT INTO documents SELECT * FROM df")
I have tested it and it runs on a docker image with the exact same code, variables and requirements. However, when running it in the space it throws a HTTP 401 error:
HTTPException: HTTP Error: HTTP GET error on ‘https://huggingface.co/datasets/user/dataset/data/*.parquet’ (HTTP 401)
Could you please help me understand the issue or If I have missed a similar forum post could you point me to it?
Thanks!
1 Like
The contents of the dataset_uri variable are probably wrong. I think you were trying to use a wildcard specification, but the program is trying to access the file https://huggingface.co/datasets/user/dataset/data/*.parquet as if it actually exists.
It’s difficult to specify wildcards for HF datasets, but you can use the following function to get a list of the dataset files.
So a bit more info I should have probably provided in the initial issue. The dataset is of form:
hf://datasets/⟨my_username⟩/⟨my_dataset⟩/⟨path_to_file⟩
I have tried pointing to a single file like so:
SELECT *
FROM 'hf://datasets/datasets-examples/doc-formats-parquet-1/data/train-00000-of-00001.parquet';
or to the whole set of parquet files of the dataset like so:
SELECT count(*) AS count
FROM 'hf://datasets/cais/mmlu/astronomy/*.parquet';
Indeed I have tried and using both a local setup and a docker image I can read from the dataset using the code provided in the initial post.
Therefore, my suspicion is that it could either be an issue with the spaces’ privileges or variable handling.
To be more precise, variables are handled with the python-dotenv library, using load_dotenv(). dataset_uri is then initialized with:
load_dotenv()
os.getenv("dataset_uri")
1 Like
Just to confirm, if you are using a private or gated repo, you will need your read token.
The following settings seem to be required for this library.
Some HF libraries may implicitly use the HF_TOKEN environment variable, but third-party libraries usually require it to be explicitly specified.
Indeed I am using the steps provided in the documentation you sent the link to (it was in my initial post and all the examples were extracted from there).
I am using a private dataset repo; hf_token (with read permissions) is a space secret and then loaded with python-dotenv:
load_dotenv()
hf_token = os.getenv("HF_TOKEN")
with duckdb.connect() as con:
con.execute(f"""CREATE SECRET hf_token (TYPE HUGGINGFACE, TOKEN '{hf_token}');""")
Again, using this approach works locally and with a docker image. So my suspicion is that there might be some privilege issue in spaces that blocks the request. Any ideas?
Spaces has a lot of restrictions that we users don’t really understand. If it’s a configurable item, you can generally set it from the following.