I have been running a map
function and saving cache to a specific file using cache_file_name
.
I have ran out of memory on the device half way through the mapping process - accumulating more than 150GB of cached files. Each cache file starts with ‘tmp’ i.e. tmp_46cq0rk.
Is there a way to recover? I have load_from_cache_file=True
and the cache_file_name
is still set to the same cache folder. However, if I re-run, the mapping is the same speed as before - therefore I assume it is not using the previous cached files and is instead recomputing.
Is it possible to reuse instead of recompute or is there anything else that would be prevent this from happening again? I.e. is there anything more elegant than splitting the dataset multiple times and mapping batches of a dataset, then concatenating the mapped outputs back together?
Hi ! you can use Dataset.from_file
to load a cached arrow file and check its content.
Then if you have multiple cache files to combine together you can use concatenate_datasets
1 Like
Thank you @lhoestq for your quick response.
Unfortunately I was unable to recover the tmp files with Dataset.from_file
as it gave me the error:
Tried reading schema message, was null or length 0
File "corpus.py", line 142, in <module>
dataset = Dataset.from_file('cache/tmp2c2hggi_')
pyarrow.lib.ArrowInvalid: Tried reading schema message, was null or length 0
It looks like it requires the mapping process to complete before the arrow schema is valid/not corrupt.
Nonetheless, I ran a test, and I could recover the cache after it had completed the mapping - it no longer had the name “tmp…” format and had the name I set in cache_file_name
. You can then concatenate multiple of these cached files (i.e. when num_proc
is greater than 1) into one dataset and this is good to know if something fails on a downstream task.