I attempted to use my own dataset on Kaggle, which I created using save_to_disk and uploaded as a Kaggle dataset. I can load this dataset with load_from_disk('/kaggle/input/my_dataset_dir'), but I can’t perform operations like ‘train_test_split’ because of the read-only filesystem:
A complete read-only filesystem? So no possibility of any form of disk based caching on the whole environment? Not sure if it truly works, but, logically this approach should work:
# Disable datasets caching
from datasets import set_caching_enabled
set_caching_enabled(False)
# Load from disk and keep fully in memory
load_from_disk(path, keep_in_memory=True)
If you have to do extensive preparations / mapping and run into mem issues (beside doing the splitting), process/map the dataset to final usability and remove any not required columns BEFORE saving to disk and uploading to the readonly filesystem.
If that doesnt work: Split and then save and upload.
This was a bad idea to use this format for storing the dataset on a kaggle drive. Loading is very slow. Using load_dataset() with data_files works much better.
You can specify where the train/test row indices are saved on disk by specifying the train_indices_cache_file_name and test_indices_cache_file_name arguments to train_test_split.
Feel free to choose a location where you have write access.
This is the same for other methods like map, filter etc. where you can specify cache_file_name