Restoring from a checkpoint when training on a large dataset with streaming

hr0nix · June 17, 2023, 2:37pm

Hello!

I’m trying to find a reasonable solution for restoring LLM training when training on a large dataset in streaming mode. In a nutshell what I need to do is to skip to a batch given its index.

The only recommended solution that I’ve found here so far is to just use skip to go to the required batch. However, as far as I understand, that would require re-running data pipeline for all previous batches. So if my training crashed after processing 100Gb of data, I’d need to re-download and re-process all 100Gb before I can continue training, which would be very time-consuming.

For a sharded dataset a more reasonable solution would be to manually iterate over shards, while streaming data within each shard. In this case, when restoring from checkpoint, I can just restore the current shard index and then run skip within the shard, which should be much faster. However, I cannot find any way to achieve that using HF API.

Is it something that’s currently possible? If so, how can I implement it or something similar?

mariosasko · June 19, 2023, 5:55pm

A similar question has been asked and answered here: Offer an alternative to Iterable Dataset that allows lazy loading and processing while skipping batches efficiently · Issue #5905 · huggingface/datasets · GitHub

Topic		Replies	Views
Would it be possible to implement and Iterable dataset with streaming and fast resume (no need to skip batches) 🤗Datasets	3	1243	October 7, 2024
Trainer fails to resume training from a checkpoint, claiming there's not enough samples in the dataset Intermediate	1	1611	May 29, 2023
Best practices for a large dataset 🤗Datasets	7	1345	May 6, 2025
Resume_from_checkpoint & skipping batches, why does the processing function need to be run for skipped batches? 🤗Transformers	7	3432	May 15, 2023
batched I/O from disk when load_dataset API is used? 🤗Datasets	2	27	January 27, 2025

Restoring from a checkpoint when training on a large dataset with streaming

Related topics