Load_from_disk and read-only filesystem

Hello,

I attempted to use my own dataset on Kaggle, which I created using save_to_disk and uploaded as a Kaggle dataset. I can load this dataset with load_from_disk('/kaggle/input/my_dataset_dir'), but I can’t perform operations like ‘train_test_split’ because of the read-only filesystem:

OSError: [Errno 30] Read-only file system: '/kaggle/input/my_dataset_dir/tmp5x3dj6no'

When I try loading it with load_dataset(streaming=True), I encounter an error:

ValueError: You are trying to load a dataset that was saved using `save_to_disk`. Please use `load_from_disk` instead.

What should I do to use my datasets on a read-only filesystem?

A complete read-only filesystem? So no possibility of any form of disk based caching on the whole environment? Not sure if it truly works, but, logically this approach should work:

# Disable datasets caching
from datasets import set_caching_enabled
set_caching_enabled(False)

# Load from disk and keep fully in memory 
load_from_disk(path, keep_in_memory=True)

If you have to do extensive preparations / mapping and run into mem issues (beside doing the splitting), process/map the dataset to final usability and remove any not required columns BEFORE saving to disk and uploading to the readonly filesystem.

If that doesnt work: Split and then save and upload.

The entire filesystem is in read-write mode, while the datasets directory at /kaggle/input/*, which is likely mounted via NFS, is in read-only mode.

Try it with disabled caching and keep in memory. I can’t think of any other option, beside loading an already fully processed/mapped dataset.

This was a bad idea to use this format for storing the dataset on a kaggle drive. Loading is very slow. Using load_dataset() with data_files works much better.

You can specify where the train/test row indices are saved on disk by specifying the train_indices_cache_file_name and test_indices_cache_file_name arguments to train_test_split.
Feel free to choose a location where you have write access.

This is the same for other methods like map, filter etc. where you can specify cache_file_name