Problem with custom iterator of streaming dataset not returning anything

johnwee1 · June 28, 2024, 8:41am

I am following this code tutorial Fine-tuning a Code LLM on Custom Code on a single GPU - Hugging Face Open-Source AI Cookbook and following the provided code but am using a different code dataset.

Below is the code for my dataset

commitpackft = load_dataset(
    "chargoddard/commitpack-ft-instruct", split="train", streaming=True
).filter(lambda example: example["language"] == "Python")


def form_template(example):
    """Forms a template for each example following the alpaca format for CommitPack"""
    example["content"] = (
        "### Human: " + example["instruction"] + " " + example["input"] + " ### Assistant: " + example["output"]
    )
    return example


dataset = commitpackft.map(
    form_template,
    remove_columns=["id", "language", "license", "instruction", "input", "output"],
).shuffle(
    seed=42, buffer_size=10000
)  # remove everything since its all inside "content" now
validation_data = dataset.take(4000)
train_data = dataset.skip(4000)

I am following the tutorial as is except because my training examples are formatted with alpaca prompt, I am omitting the FIM transforms. My custom iterator’s init() is as follows (largely copied from the above)

def __iter__(self):
        iterator = iter(self.dataset)
        more_examples = True
        while more_examples:
            buffer, buffer_len = [], 0
            while True:
                if buffer_len >= self.max_buffer_size:
                    break
                try:
                    buffer.append(next(iterator)[self.content_field])
                    buffer_len += len(buffer[-1])
                except StopIteration:
                    if self.infinite:
                        iterator = iter(self.dataset)
                    else:
                        more_examples = False
                        break
            tokenized_inputs = self.tokenizer(buffer, truncation=False)["input_ids"]
            all_token_ids = []

            for tokenized_input in tokenized_inputs:
               all_token_ids.extend(tokenized_input + [self.concat_token_id])
            examples = []
            for i in range(0, len(all_token_ids), self.seq_length):
                input_ids = all_token_ids[i : i + self.seq_length]
                if len(input_ids) == self.seq_length:
                    examples.append(input_ids)
            random.shuffle(examples)
            for example in examples:
                self.current_size += 1
                for example in examples:
                self.current_size += 1
                yield {
                    "input_ids": torch.LongTensor(example),
                    "labels": torch.LongTensor(example),
                }

but I am not for the life of me able to get past the 2nd evaluation:

    return super()._batch_encode_plus(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/anaconda3/envs/finetune/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 544, in _batch_encode_plus
    for key in tokens_and_encodings[0][0].keys():
               ~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
 10%|█         | 200/2000 [12:20<1:51:06,  3.70s/it]

Indeed, it is able to train fine, and it seems to pass its first round of evaluations fine.

Upon investigating, it seems that when I turn eval_steps down to do so every 5 steps, sometimes the iterator = iter(self.dataset) seems to return nothing when next(iterator) is called and it just loops forever because the buffer length is never hit, but I don’t understand how 4000 examples could possibly be iterated through on like the very first few evaluation passes. What am I doing wrong here, or is there behavior under the hood for HF iterableDataset that im not understanding? It’s really hard to debug because of all the different abstractions of the trainer, on top of iterabledataset sharing the same name as pytorch.util’s iterabledataset, and that there are some unused variables / imports in the code that i’m not sure should be unused (like self.current_size)?

My hyperparams and model used (starcoderbase-1b) follow the tutorial pretty much exactly, except my batch size=16.

When eval_step is 5, I also get this error:

File ~/anaconda3/envs/finetune/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:544, in PreTrainedTokenizerFast._batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
    537 # Convert the output to have dict[list] from list[dict] and remove the additional overflows dimension
    538 # From (variable) shape (batch, overflows, sequence length) to ~ (batch * overflows, sequence length)
    539 # (we say ~ because the number of overflow varies with the example in the batch)
    540 #
    541 # To match each overflowing sample with the original sample in the batch
    542 # we add an overflow_to_sample_mapping array (see below)
    543 sanitized_tokens = {}
--> 544 for key in tokens_and_encodings[0][0].keys():
    545     stack = [e for item, _ in tokens_and_encodings for e in item[key]]
    546     sanitized_tokens[key] = stack

IndexError: list index out of range

Here is my code if you want to replicate it pastebin.com/sPkS8d4h

Topic		Replies	Views
Cannot stream custom dataset 🤗Datasets	1	536	October 11, 2023
Issue with iterable dataset that is stuck on StopIteration 🤗Datasets	4	223	August 19, 2024
Training a Tokenizer on a Streamed Dataset Beginners	5	1342	May 30, 2023
Issues with Trainer class on custom dataset 🤗Transformers	3	7295	June 14, 2023
Not able to use Custom Speech Data for training ASR 🤗Datasets	2	320	September 20, 2023

Problem with custom iterator of streaming dataset not returning anything

Related topics