I am following this code tutorial Fine-tuning a Code LLM on Custom Code on a single GPU - Hugging Face Open-Source AI Cookbook and following the provided code but am using a different code dataset.
Below is the code for my dataset
commitpackft = load_dataset(
"chargoddard/commitpack-ft-instruct", split="train", streaming=True
).filter(lambda example: example["language"] == "Python")
def form_template(example):
"""Forms a template for each example following the alpaca format for CommitPack"""
example["content"] = (
"### Human: " + example["instruction"] + " " + example["input"] + " ### Assistant: " + example["output"]
)
return example
dataset = commitpackft.map(
form_template,
remove_columns=["id", "language", "license", "instruction", "input", "output"],
).shuffle(
seed=42, buffer_size=10000
) # remove everything since its all inside "content" now
validation_data = dataset.take(4000)
train_data = dataset.skip(4000)
I am following the tutorial as is except because my training examples are formatted with alpaca prompt, I am omitting the FIM transforms. My custom iterator’s init()
is as follows (largely copied from the above)
def __iter__(self):
iterator = iter(self.dataset)
more_examples = True
while more_examples:
buffer, buffer_len = [], 0
while True:
if buffer_len >= self.max_buffer_size:
break
try:
buffer.append(next(iterator)[self.content_field])
buffer_len += len(buffer[-1])
except StopIteration:
if self.infinite:
iterator = iter(self.dataset)
else:
more_examples = False
break
tokenized_inputs = self.tokenizer(buffer, truncation=False)["input_ids"]
all_token_ids = []
for tokenized_input in tokenized_inputs:
all_token_ids.extend(tokenized_input + [self.concat_token_id])
examples = []
for i in range(0, len(all_token_ids), self.seq_length):
input_ids = all_token_ids[i : i + self.seq_length]
if len(input_ids) == self.seq_length:
examples.append(input_ids)
random.shuffle(examples)
for example in examples:
self.current_size += 1
for example in examples:
self.current_size += 1
yield {
"input_ids": torch.LongTensor(example),
"labels": torch.LongTensor(example),
}
but I am not for the life of me able to get past the 2nd evaluation:
return super()._batch_encode_plus(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/anaconda3/envs/finetune/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 544, in _batch_encode_plus
for key in tokens_and_encodings[0][0].keys():
~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
10%|â–ˆ | 200/2000 [12:20<1:51:06, 3.70s/it]
Indeed, it is able to train fine, and it seems to pass its first round of evaluations fine.
Upon investigating, it seems that when I turn eval_steps down to do so every 5 steps, sometimes the iterator = iter(self.dataset)
seems to return nothing when next(iterator)
is called and it just loops forever because the buffer length is never hit, but I don’t understand how 4000 examples could possibly be iterated through on like the very first few evaluation passes. What am I doing wrong here, or is there behavior under the hood for HF iterableDataset that im not understanding? It’s really hard to debug because of all the different abstractions of the trainer, on top of iterabledataset sharing the same name as pytorch.util’s iterabledataset, and that there are some unused variables / imports in the code that i’m not sure should be unused (like self.current_size)?
My hyperparams and model used (starcoderbase-1b) follow the tutorial pretty much exactly, except my batch size=16.
When eval_step is 5, I also get this error:
File ~/anaconda3/envs/finetune/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:544, in PreTrainedTokenizerFast._batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
537 # Convert the output to have dict[list] from list[dict] and remove the additional overflows dimension
538 # From (variable) shape (batch, overflows, sequence length) to ~ (batch * overflows, sequence length)
539 # (we say ~ because the number of overflow varies with the example in the batch)
540 #
541 # To match each overflowing sample with the original sample in the batch
542 # we add an overflow_to_sample_mapping array (see below)
543 sanitized_tokens = {}
--> 544 for key in tokens_and_encodings[0][0].keys():
545 stack = [e for item, _ in tokens_and_encodings for e in item[key]]
546 sanitized_tokens[key] = stack
IndexError: list index out of range
Here is my code if you want to replicate it pastebin.com/sPkS8d4h