I鈥檓 working with gaia-benchmark/GAIA 路 Datasets at Hugging Face. There鈥檚 a column in this dataset for file_name and file_path. On the dataset viewer, these show up as urls, for example:
This specific dataset is based on a loading script (GAIA.py 路 gaia-benchmark/GAIA at main) and the way it is implemented causes the paths to be either HTTP paths or local paths based on whether the dataset is loaded in streaming mode or not.
This causes this issue in the Viewer because we show the first page of the Viewer using streaming, while the other pages are read from the parquet export (computed on machines that have the /storage/hf-datasets-cache directory).
Anyway this behavior is specific to this dataset and to its loading script so we will probably not fix this issue on the HF side. Instead it would be great to convert this dataset to a no-code dataset, which is the recommended way.
If you wish to run the script by yourself, you can download the GAIA dataset repository and do load_dataset("path/to_my_local/GAIA")
Usually datasets that can be loaded by the datasets library and shown in the Dataset Viewer are self-contained and don鈥檛 contain paths to external files. They are generally in Parquet, JSON, CSV, or image/audio formats