Recover Cached Tmp Files During Mapping

WillPowellUK · November 7, 2024, 4:14pm

I have been running a map function and saving cache to a specific file using cache_file_name.

I have ran out of memory on the device half way through the mapping process - accumulating more than 150GB of cached files. Each cache file starts with ‘tmp’ i.e. tmp_46cq0rk.

Is there a way to recover? I have load_from_cache_file=True and the cache_file_name is still set to the same cache folder. However, if I re-run, the mapping is the same speed as before - therefore I assume it is not using the previous cached files and is instead recomputing.

Is it possible to reuse instead of recompute or is there anything else that would be prevent this from happening again? I.e. is there anything more elegant than splitting the dataset multiple times and mapping batches of a dataset, then concatenating the mapped outputs back together?

lhoestq · November 7, 2024, 8:48pm

Hi ! you can use Dataset.from_file to load a cached arrow file and check its content.

Then if you have multiple cache files to combine together you can use concatenate_datasets

WillPowellUK · November 8, 2024, 5:06pm

Thank you @lhoestq for your quick response.

Unfortunately I was unable to recover the tmp files with Dataset.from_file as it gave me the error:

Tried reading schema message, was null or length 0
  File "corpus.py", line 142, in <module>
    dataset = Dataset.from_file('cache/tmp2c2hggi_')
pyarrow.lib.ArrowInvalid: Tried reading schema message, was null or length 0

It looks like it requires the mapping process to complete before the arrow schema is valid/not corrupt.

Nonetheless, I ran a test, and I could recover the cache after it had completed the mapping - it no longer had the name “tmp…” format and had the name I set in cache_file_name. You can then concatenate multiple of these cached files (i.e. when num_proc is greater than 1) into one dataset and this is good to know if something fails on a downstream task.

Topic		Replies	Views
[urgent]Can you reconstruct datasets using the cache file (.arrow file)? 🤗Datasets	5	1070	August 27, 2021
Loading dataset from cache .arrow file 🤗Datasets	1	742	March 28, 2023
Load dataset from a specific cache file 🤗Datasets	3	1221	February 26, 2024
Working with large datasets - cache issues 🤗Datasets	1	1025	June 1, 2022
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2724	March 22, 2023

Recover Cached Tmp Files During Mapping

Related topics