How to resolve file paths in a downloaded dataset?

I鈥檓 working with gaia-benchmark/GAIA 路 Datasets at Hugging Face. There鈥檚 a column in this dataset for file_name and file_path. On the dataset viewer, these show up as urls, for example:

https://huggingface.co/datasets/gaia-benchmark/GAIA/resolve/main/2023/validation/32102e3e-d12a-4209-9163-7b3a104efe5d.xlsx

However, once I鈥檝e downloaded the dataset with load_dataset, these show up as file paths. These file paths do not exist on my machine:

/storage/hf-datasets-cache/medium/datasets/60530074150638-config-parquet-and-info-gaia-benchmark-GAIA-6a4b2225/dow
nloads/9962f64d6418e68fd02995f9f3b05a65dc562b07bf9dd2299beeef4e5801a411

How do I download and access these files? Do I have to do something outside of load_dataset?

This specific dataset is based on a loading script (GAIA.py 路 gaia-benchmark/GAIA at main) and the way it is implemented causes the paths to be either HTTP paths or local paths based on whether the dataset is loaded in streaming mode or not.

This causes this issue in the Viewer because we show the first page of the Viewer using streaming, while the other pages are read from the parquet export (computed on machines that have the /storage/hf-datasets-cache directory).

Anyway this behavior is specific to this dataset and to its loading script so we will probably not fix this issue on the HF side. Instead it would be great to convert this dataset to a no-code dataset, which is the recommended way.

If you wish to run the script by yourself, you can download the GAIA dataset repository and do load_dataset("path/to_my_local/GAIA")

Got it, thank you! Is there an dataset which does things in the recommended way I could have a look at?

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Usually datasets that can be loaded by the datasets library and shown in the Dataset Viewer are self-contained and don鈥檛 contain paths to external files. They are generally in Parquet, JSON, CSV, or image/audio formats