Would it be possible to implement and Iterable dataset with streaming and fast resume (no need to skip batches)

tigranfah · September 23, 2023, 4:50pm

Training an LLM on a big dataset (depending on the project) takes considerable amount of time. However various types of interrupts may occur during the training, which requires resume.

In the case when using iterable dataset with streaming (which I suspect most do), when resuming huggingface runs over the batches (skips batches), until it reaches the appropriate place to continue the training. However, running over the batches can take a lot of time.

I wonder would not it be possible to implement an iterable dataset with fast resume, at least in the case when the training data consist of text files.

Considering this code, one can jump (file.seek function) in the file to an appropriate place (where it stopped) and not have iterate from the beginning and reach to the place where it stopped.

class DatasetWithFastResume:

    def __init__(self, file_path: str):
        self._file_path = file_path
        self._f = open(file_path, "r")
        self._line_number = 0

    def __next__(self) -> str:
        if self._f.closed: return None
        line = self._f.readline()[:-1] # exclude the newline character
        if not line:
            if not self._f.closed:
                self._f.close()
            return None
        self._line_number += 1
        return line
    
    def get_read_position(self) -> int:
        """
            returns an integer giving the file object’s current position in the file
            represented as number of bytes from the beginning of the file
        """
        return self._f.tell()

    def set_read_position(self, position: int):
        self._f.seek(position, 0)

    def get_state(self) -> Dict[str, Any]:
        return {
            "position": self.get_read_position(),
            "line_number": self._line_number
        }
    
    def load_state(self, state: Dict[str, Any]):
        self.set_read_position(state["position"])
        self._line_number = state["line_number"]

In the most general case what is suggested may not guarantee a resumed training equivalent to training without any resume. However, many applications don’t necessarily care about 100% correct resume, so I think this is an option that should be available, because it is not worth the time to spend waiting for the DataLoader to skip batches.

lhoestq · September 25, 2023, 10:02am

Hi ! This is not implemented yet but would be awesome to have indeed.

Here are some related discussions: Save and resume the state of a DataLoader · Issue #5454 · huggingface/datasets · GitHub

lhoestq · February 19, 2024, 3:50pm

I recently opened a draft PR that implements state_dict() for IterableDataset and enables resuming: [Resumable IterableDataset] Add IterableDataset state_dict by lhoestq · Pull Request #6658 · huggingface/datasets · GitHub

It’s WIP/untested and relies on skipping shards and batches (no .seek()) but should be a good starting point

Topic		Replies	Views
Roadmap/timeline for dataset streaming 🤗Datasets	9	2052	July 5, 2021
Resume_from_checkpoint & skipping batches, why does the processing function need to be run for skipped batches? 🤗Transformers	7	2011	May 15, 2023
Restoring from a checkpoint when training on a large dataset with streaming 🤗Datasets	1	473	June 19, 2023
How do I iterate through <class 'datasets.dataset_dict.IterableDatasetDict'>? Beginners	2	315	January 15, 2024
How to use Huggingface Trainer streaming Datasets without wrapping it with torchdata's IterableWrapper? 🤗Datasets	1	3281	October 30, 2022

Would it be possible to implement and Iterable dataset with streaming and fast resume (no need to skip batches)

Related Topics