Hi,
I have this dataset:
https://huggingface.co/datasets/aborruso/openncup-focus-pnrr .
If I run this simply command
select * from read_parquet('https://huggingface.co/datasets/aborruso/openncup-focus-pnrr/blob/refs%2Fconvert%2Fparquet/aborruso--openncup-focus-pnrr/train/index.duckdb') limit 2;
I have this error:
Error: IO Error: HTTP GET error: Content-Length from server mismatches requested range, server may not support range requests.
What’s the right URL to access to it, using duckdb cli and https extension?
Thank you
Hi! This doc page explains how to access the Parquet export of a dataset.
The dataset in question has a single Parquet file (for the train
split): https://huggingface.co/api/datasets/aborruso/openncup-focus-pnrr/parquet/aborruso--openncup-focus-pnrr/train/0.parquet
(a quick test in DuckDB CLI on my local machine works as expected)
2 Likes
Thank you very much. I annotate here the steps I prefer to do it:
open the dataset page and click on API;
copy the “List the Parquet files for this dataset” curl command
run it and you have the URL(s) of your parquet dataset file(s)
1 Like
severo
August 18, 2023, 2:09pm
4
To be complete, a third way to have them is to edit the dataset url:
https://huggingface.co/datasets/aborruso/openncup-focus-pnrr
by adding /api
at the start, and /parquet
at the end
https://huggingface.co/api/datasets/aborruso/openncup-focus-pnrr/parquet
1 Like
Really useful, thank you very much @severo
1 Like