Load_from_disk and read-only filesystem

tsobolev · September 10, 2023, 1:59pm

Hello,

I attempted to use my own dataset on Kaggle, which I created using save_to_disk and uploaded as a Kaggle dataset. I can load this dataset with load_from_disk('/kaggle/input/my_dataset_dir'), but I can’t perform operations like ‘train_test_split’ because of the read-only filesystem:

OSError: [Errno 30] Read-only file system: '/kaggle/input/my_dataset_dir/tmp5x3dj6no'

When I try loading it with load_dataset(streaming=True), I encounter an error:

ValueError: You are trying to load a dataset that was saved using `save_to_disk`. Please use `load_from_disk` instead.

What should I do to use my datasets on a read-only filesystem?

ReatKay · September 10, 2023, 4:22pm

A complete read-only filesystem? So no possibility of any form of disk based caching on the whole environment? Not sure if it truly works, but, logically this approach should work:

# Disable datasets caching
from datasets import set_caching_enabled
set_caching_enabled(False)

# Load from disk and keep fully in memory 
load_from_disk(path, keep_in_memory=True)

If you have to do extensive preparations / mapping and run into mem issues (beside doing the splitting), process/map the dataset to final usability and remove any not required columns BEFORE saving to disk and uploading to the readonly filesystem.

If that doesnt work: Split and then save and upload.

tsobolev · September 10, 2023, 4:45pm

The entire filesystem is in read-write mode, while the datasets directory at /kaggle/input/*, which is likely mounted via NFS, is in read-only mode.

ReatKay · September 10, 2023, 4:50pm

Try it with disabled caching and keep in memory. I can’t think of any other option, beside loading an already fully processed/mapped dataset.

tsobolev · September 11, 2023, 11:28pm

This was a bad idea to use this format for storing the dataset on a kaggle drive. Loading is very slow. Using load_dataset() with data_files works much better.

lhoestq · September 21, 2023, 10:34am

You can specify where the train/test row indices are saved on disk by specifying the train_indices_cache_file_name and test_indices_cache_file_name arguments to train_test_split.
Feel free to choose a location where you have write access.

This is the same for other methods like map, filter etc. where you can specify cache_file_name

Topic		Replies	Views
Save and load datasets 🤗Datasets	2	40436	August 16, 2021
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5671	September 18, 2020
Does huggingface support load raw text dataset from hdfs? 🤗Datasets	3	1298	January 9, 2022
Load dataset from cache in offline mode 🤗Datasets	1	1730	January 23, 2023
Extend load_from_disk and save_to_disk to remote storage 🤗Datasets	3	535	October 12, 2020

Load_from_disk and read-only filesystem

Related topics