Duckdb cli: what's the URL to access to a dataset?

Hi,
I have this dataset:
https://huggingface.co/datasets/aborruso/openncup-focus-pnrr.

If I run this simply command

select * from read_parquet('https://huggingface.co/datasets/aborruso/openncup-focus-pnrr/blob/refs%2Fconvert%2Fparquet/aborruso--openncup-focus-pnrr/train/index.duckdb') limit 2;

I have this error:

Error: IO Error: HTTP GET error: Content-Length from server mismatches requested range, server may not support range requests.

What’s the right URL to access to it, using duckdb cli and https extension?

Thank you

Hi! This doc page explains how to access the Parquet export of a dataset.

The dataset in question has a single Parquet file (for the train split): https://huggingface.co/api/datasets/aborruso/openncup-focus-pnrr/parquet/aborruso--openncup-focus-pnrr/train/0.parquet (a quick test in DuckDB CLI on my local machine works as expected)

2 Likes

Thank you very much. I annotate here the steps I prefer to do it:

  • open the dataset page and click on API;

  • copy the “List the Parquet files for this dataset” curl command
  • run it and you have the URL(s) of your parquet dataset file(s)
1 Like

To be complete, a third way to have them is to edit the dataset url:

https://huggingface.co/datasets/aborruso/openncup-focus-pnrr

by adding /api at the start, and /parquet at the end

https://huggingface.co/api/datasets/aborruso/openncup-focus-pnrr/parquet

1 Like

Really useful, thank you very much @severo

1 Like