I’m currently utilizing an LLM to generate a synthetic dataset, a process that could extend over several weeks on a single GPU. I am using the .map
function for the transformation of the existing dataset into a synthetic one. My query revolves around efficient data caching strategies that enable process resumption from the last checkpoint if the operation is interrupted or terminated unexpectedly.
Although I’ve tried using the cache_file_name
parameter with the .map
function, it doesn’t seem to resume from where it left off; instead, it restarts from the beginning. Is there a more effective method or workaround to achieve resumable processing, ensuring that progress isn’t lost and computation time isn’t wasted?
Any advice or experiences shared would be greatly appreciated.
cc: @lhoestq