Resuming .map Transform with Intermediate Caching in Dataset Generation

kdcyberdude · February 18, 2024, 5:38pm

I’m currently utilizing an LLM to generate a synthetic dataset, a process that could extend over several weeks on a single GPU. I am using the .map function for the transformation of the existing dataset into a synthetic one. My query revolves around efficient data caching strategies that enable process resumption from the last checkpoint if the operation is interrupted or terminated unexpectedly.

Although I’ve tried using the cache_file_name parameter with the .map function, it doesn’t seem to resume from where it left off; instead, it restarts from the beginning. Is there a more effective method or workaround to achieve resumable processing, ensuring that progress isn’t lost and computation time isn’t wasted?

Any advice or experiences shared would be greatly appreciated.

cc: @lhoestq

lhoestq · February 19, 2024, 1:31pm

Hi ! The cache is able to reload results once they’re completely processed, it’s not able to reload and resume a partial map() call.

There no native resuming in datasets, though you can e.g. chunk your dataset and process it by chunk

Topic		Replies	Views
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2722	March 22, 2023
Working with large datasets - cache issues 🤗Datasets	1	1024	June 1, 2022
[urgent]Can you reconstruct datasets using the cache file (.arrow file)? 🤗Datasets	5	1070	August 27, 2021
Keeping only current dataset state in cache 🤗Datasets	3	1300	August 30, 2022
Dataset map() creates lot of cache files 🤗Datasets	6	6430	March 26, 2024

Resuming .map Transform with Intermediate Caching in Dataset Generation

Related topics