How to resolve file paths in a downloaded dataset?

codezakh · March 9, 2024, 1:39am

I’m working with gaia-benchmark/GAIA · Datasets at Hugging Face. There’s a column in this dataset for file_name and file_path. On the dataset viewer, these show up as urls, for example:

https://huggingface.co/datasets/gaia-benchmark/GAIA/resolve/main/2023/validation/32102e3e-d12a-4209-9163-7b3a104efe5d.xlsx

However, once I’ve downloaded the dataset with load_dataset, these show up as file paths. These file paths do not exist on my machine:

/storage/hf-datasets-cache/medium/datasets/60530074150638-config-parquet-and-info-gaia-benchmark-GAIA-6a4b2225/dow
nloads/9962f64d6418e68fd02995f9f3b05a65dc562b07bf9dd2299beeef4e5801a411

How do I download and access these files? Do I have to do something outside of load_dataset?

lhoestq · March 18, 2024, 11:23am

This specific dataset is based on a loading script (GAIA.py · gaia-benchmark/GAIA at main) and the way it is implemented causes the paths to be either HTTP paths or local paths based on whether the dataset is loaded in streaming mode or not.

This causes this issue in the Viewer because we show the first page of the Viewer using streaming, while the other pages are read from the parquet export (computed on machines that have the /storage/hf-datasets-cache directory).

Anyway this behavior is specific to this dataset and to its loading script so we will probably not fix this issue on the HF side. Instead it would be great to convert this dataset to a no-code dataset, which is the recommended way.

If you wish to run the script by yourself, you can download the GAIA dataset repository and do load_dataset("path/to_my_local/GAIA")

codezakh · March 18, 2024, 3:14pm

Got it, thank you! Is there an dataset which does things in the recommended way I could have a look at?

system · March 19, 2024, 3:14am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

lhoestq · March 20, 2024, 4:42pm

Usually datasets that can be loaded by the datasets library and shown in the Dataset Viewer are self-contained and don’t contain paths to external files. They are generally in Parquet, JSON, CSV, or image/audio formats

Topic		Replies	Views
Problem accessing dataset Beginners	5	16470	January 11, 2023
Unable to resolve any data file after loading once 🤗Datasets	1	1821	December 21, 2021
Unable to load cosmos_qa dataset using load_dataset() method 🤗Datasets	0	261	March 3, 2024
PIL.UnidentifiedImageError: cannot identify image file 🤗Datasets	4	8372	March 3, 2023
Problem when downloading image dataset Beginners	2	62	October 28, 2024

How to resolve file paths in a downloaded dataset?

Related topics