Resuming .map Transform with Intermediate Caching in Dataset Generation

I’m currently utilizing an LLM to generate a synthetic dataset, a process that could extend over several weeks on a single GPU. I am using the .map function for the transformation of the existing dataset into a synthetic one. My query revolves around efficient data caching strategies that enable process resumption from the last checkpoint if the operation is interrupted or terminated unexpectedly.

Although I’ve tried using the cache_file_name parameter with the .map function, it doesn’t seem to resume from where it left off; instead, it restarts from the beginning. Is there a more effective method or workaround to achieve resumable processing, ensuring that progress isn’t lost and computation time isn’t wasted?

Any advice or experiences shared would be greatly appreciated.

cc: @lhoestq

Hi ! The cache is able to reload results once they’re completely processed, it’s not able to reload and resume a partial map() call.

There no native resuming in datasets, though you can e.g. chunk your dataset and process it by chunk

1 Like