Fine-tuning throws "index out of range in self"

I am totally new to ML and learning as I go for a work project, where we are attempting to fine-tune a pretrained LLM using the company’s data, which consists of magazine articles, podcast transcripts, and discussion threads. Our goal is to create a useful, custom chatbot for our online community.

It is my understanding that the HuggingFace transformers load_dataset function can use rather unstructured plaintext, as opposed to requiring the text to be structured within a JSON object or JSONL file; however, when I attempt to pass in data of this type, I am getting the generic error, “index out of range in self”.

Below is a reduced version of the code, which runs successfully up until the trainer.train() line is executed, but it throws the error rather quickly after about 10 seconds.

base_model = "tiiuae/falcon-7b"  # I have tried numerous models, like mpt_7b, distilbert_base_uncased, and moe but always get the same error.
number_of_threads = 4

tokenizer = AutoTokenizer.from_pretrained(base_model, cache_dir=hugging_face_cache_dir)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': padding_token})

train_dataset = load_dataset('text', data_files={'train': '/path/to/my/train/files',
    'test': '/path/to/my/test/files'},
    cache_dir=hugging_face_cache_dir, sample_by="paragraph")
tokenized_train_dataset = train_dataset.map(
    lambda examples: tokenizer(examples\["text"\], padding="max_length",
    truncation=True, return_tensors="np"),
    batched=True, num_proc=number_of_threads)

val_dataset = load_dataset('text', data_files={'validation': val_split_filename},
    cache_dir=hugging_face_cache_dir, sample_by="paragraph")
tokenized_val_dataset = val_dataset.map(
    lambda examples: tokenizer(examples\["text"\], padding="max_length",
    truncation=True, return_tensors="np"),
    batched=True, num_proc=number_of_threads)

train_dataset = tokenized_train_dataset\['train'\].shuffle(seed=42)
eval_dataset = tokenized_val_dataset\['validation'\]
model = AutoModel.from_pretrained(base_model,
    trust_remote_code=True,
    cache_dir=hugging_face_cache_dir)
training_args = TrainingArguments(
    output_dir=FileMgr.checkpoint_batch_dir,
    evaluation_strategy=IntervalStrategy.EPOCH,
    save_strategy=IntervalStrategy.EPOCH,
    num_train_epochs=3,
    save_total_limit=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir=FileMgr.checkpoint_batch_dir,
    eval_steps=500,
    load_best_model_at_end=True,
    save_steps=500,
    remove_unused_columns=True
)
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)
trainer.train()

Here is an example of what our txt file content looks like:

Some data on the first line.
Some data on the second line.
And this continues on and on.
We have tried putting entire magazine articles on this line, replacing newlines with [SEP].
We've also tried ensuring lines don't exceed the max seq length of a model, as explained below.

It should maybe be noted that I have my cache system pointing to a directory off of the C drive (Windows!), but I am running PyCharm as an administrator and appear to not be having any issues reading/writing files.

Side questions: is it fine to have an entire article on one line even if it exceeds the model’s sequence length? And if so, should I set sample_by to “document” instead of “paragraph”? Or would that be more for like reading a bunch of individually-relevant files, not a conglomerate of articles as I am creating?

Initially, I read that each line could be very long, such as an entire magazine article on each line of the .txt file, an entire transcript on each line, etc., and so I replaced each newline character with “[SEP]”, and then accounted for this special token as below.

if tokenizer.sep_token is None:     
    tokenizer.add_special_tokens({'sep_token': '[SEP]'})

But then I read about the “index out of range in self” error having to do with the training inputs being too long, and so I came up with a process of first harvesting the data “as is”, and then for every unique maximum sequence length for the models we are trying to experiment with, I create/cache a new batch as necessary to ensure each line is less than the maximum token length.

To ensure that I was not exceeding the maximum token length and determine if this was the issue, I did a test where each line is only 1024 CHARACTERS to ensure it was far less than the actual sequence lengths of 512/2048/etc. TOKENS; however, after doing this I am still getting the same error.

I have also tried with and without the last line being blank to ensure the out of bounds error was not related, but it is not working.

I have done large tests using our entire dataset, which is about 2.15 GB of data spread out over 53 files, where each one is 7MB to 50MB, and when we account for each line not exceeding the sequence length ends up being hundreds of thousands of training inputs. Same error.

I have done small tests using just 12 files, each with only 4 lines, each line being only about 1,000 characters long, as well as having only alphanumeric, commas, periods, and no [SEP] token. Same error.

I have tried using a per_device_train_batch_size and per_device_eval_batch_size of 1, 8, and 500 to ensure this was not the issue, but no luck.

In the full version of the code, I cache the tokenized datasets (as below), but when the program tries to load them on subsequent runs, it gives an error saying “An error occurred while generating the dataset”, which indicates to me that even though we can tokenize the dataset without error, it is not actually in the correct format, and so is likely where the issue is.

Saving tokenized dataset:
tokenized_train_dataset.save_to_disk(tokenized_train_dataset_cache_path)

Loading tokenized dataset:
tokenized_train_dataset = load_dataset(tokenized_train_dataset_cache_path)

I realize that this training input wont necessarily create the desired output for a true chatbot, but we want to get this running to understand a baseline before we look into formatting our data further to include input and output labels.

It is also probably really important to point out that, for testing purposes, the test and validation files are basically just placeholders for now, where each file is just three sample inputs from our training data, as I am not yet sure how to format these for text training input as we’re working with.

I would be very grateful to anybody who can shed some light or point me in the right direction. Thank you in advance.

I changed “max_length” to “longest” and am now getting:

“The model did not return a loss from the inputs, only the following keys: last_hidden_state,past_key_values. For reference, the inputs it received are input_ids,attention_mask.”

TYIA

I have realized I am doing quite a few things wrong, like using a compute_metric/accuracy function, I guess that is for sequence classification? I am emmulating what I am seeing here Google Colab as this seems to be more aligned with what i need

Hi @capnchat indeed, these “index out of range in self” usually indicate that you’re passing inputs that are longer than the context size of the model (2048 tokens in the case of Falcon). See e.g. this GitHub issue for a similar problem, where the solution is usually to specify the max_length=2048 or whatever the model’s max context size is.

Now, based on your initial post, it seems you would like to train a chatbot right? For that you will likely need to use the AutoModelForCausalLM class as that’s the one that handles language modellling. A quick way to fine-tune models for chat can be found in the TRL library here: Supervised Fine-tuning Trainer

Hope that helps!

2 Likes

Finally got my project to run in colab, was able to train using a very simple test dataset, and am tokenizing my real dataset now!

1 Like

Thanks @lewtun! I have changed to AutoModelForCasualLM and have made great progress. About to test with our real dataset now!

2 Likes

Hi @lewtun . I’m facing the same error when i try to model of “hf-tiny-model-private_tiny-random-LayoutLMv3ForQuestionAnswering”

IndexError                                Traceback (most recent call last)
<ipython-input-53-fc6c5d353309> in <cell line: 62>()
    119 
    120           with torch.no_grad():
--> 121               outputs = model(**encoding)
    122               start_logits = outputs.start_logits
    123               end_logits = outputs.end_logits

12 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2231         # remove once script supports set_grad_enabled
   2232         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2233     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2234 
   2235 

IndexError: index out of range in self

my encoding is:

encoding = processor(image, question, words, boxes=boxes, max_length=2048 , padding="max_length", truncation=True, return_tensors="pt")

also my prints:

Lenght of Words: 190
Lenght of Boxes: 190
boxes: [[653, 24, 731, 49], [667, 43, 885, 75], [646, 93, 923, 111], [399, 121, 594, 142], [200, 184, 337, 199], [443, 184, 511, 196], [200, 201, 271, 213], [443, 202, 483, 212], [200, 218, 271, 230], [443, 219, 535, 231], [200, 235, 295, 247], [443, 235, 504, 247], [200, 252, 358, 268], [443, 252, 519, 264], [202, 271, 295, 282], [443, 269, 612, 285], [110, 346, 181, 355], [233, 345, 299, 357], [354, 345, 420, 357], [467, 345, 526, 355], [571, 345, 629, 355], [718, 345, 756, 355], [855, 346, 891, 355], [253, 357, 281, 366], [375, 357, 404, 366], [477, 357, 516, 368], [581, 357, 620, 368], [725, 357, 749, 369], [862, 357, 886, 369], [232, 369, 303, 378], [351, 367, 422, 381], [460, 369, 533, 378], [563, 367, 635, 381], [722, 371, 748, 378], [844, 369, 902, 378], [474, 381, 541, 390], [119, 392, 141, 402], [175, 393, 196, 402], [241, 392, 262, 402], [306, 393, 326, 402], [361, 392, 384, 402], [417, 393, 438, 402], [476, 392, 516, 402], [579, 392, 620, 402], [655, 393, 668, 401], [729, 392, 768, 402], [874, 392, 900, 402], [121, 404, 144, 413], [175, 405, 196, 413], [242, 404, 264, 413], [306, 405, 326, 414], [363, 404, 386, 413], [417, 405, 438, 413], [477, 404, 516, 414], [579, 404, 620, 414], [729, 404, 767, 413], [874, 404, 900, 413], [122, 416, 139, 426], [175, 417, 196, 425], [244, 416, 259, 425], [306, 417, 326, 425], [364, 416, 381, 425], [417, 416, 438, 425], [483, 416, 508, 425], [587, 416, 613, 426], [654, 417, 669, 424], [735, 416, 760, 426], [829, 416, 845, 425], [871, 416, 904, 426], [123, 428, 139, 437], [175, 428, 196, 437], [245, 428, 261, 437], [306, 428, 326, 437], [363, 428, 383, 437], [417, 428, 438, 437], [483, 428, 508, 437], [586, 428, 613, 438], [654, 428, 668, 436], [736, 428, 760, 437], [829, 428, 845, 436], [872, 428, 903, 437], [127, 441, 133, 448], [175, 440, 196, 448], [247, 441, 255, 448], [307, 440, 327, 449], [369, 441, 377, 448], [417, 440, 438, 448], [476, 439, 516, 449], [581, 439, 620, 449], [729, 439, 768, 449], [874, 439, 899, 449], [128, 452, 136, 460], [175, 452, 196, 460], [250, 452, 257, 460], [307, 452, 326, 461], [366, 451, 379, 460], [417, 452, 438, 460], [476, 451, 516, 461], [579, 451, 620, 461], [654, 452, 671, 460], [730, 451, 768, 461], [874, 451, 900, 461], [176, 463, 195, 475], [308, 463, 327, 475], [344, 463, 400, 473], [418, 464, 437, 473], [467, 463, 525, 473], [572, 463, 627, 473], [652, 463, 675, 472], [720, 463, 775, 473], [825, 463, 849, 472], [859, 463, 915, 473], [177, 475, 193, 484], [308, 475, 325, 484], [344, 474, 400, 485], [418, 475, 437, 484], [466, 474, 525, 485], [572, 474, 627, 485], [652, 475, 675, 483], [720, 474, 775, 485], [824, 475, 849, 483], [859, 474, 915, 485], [462, 485, 552, 498], [122, 498, 137, 508], [243, 498, 259, 508], [364, 498, 381, 508], [478, 498, 516, 508], [587, 498, 613, 508], [656, 500, 668, 506], [729, 498, 768, 508], [875, 498, 902, 508], [123, 509, 140, 520], [245, 509, 261, 520], [366, 510, 383, 520], [477, 510, 515, 520], [588, 510, 612, 520], [655, 511, 672, 519], [729, 510, 767, 520], [875, 510, 900, 520], [120, 521, 140, 532], [241, 522, 260, 531], [362, 521, 383, 532], [477, 521, 516, 532], [587, 521, 613, 532], [729, 521, 768, 532], [882, 522, 899, 531], [121, 534, 141, 543], [239, 534, 264, 544], [361, 534, 384, 544], [476, 534, 515, 543], [588, 534, 612, 543], [729, 534, 767, 543], [875, 534, 899, 543], [119, 546, 141, 555], [175, 546, 195, 555], [239, 546, 263, 555], [361, 546, 384, 555], [476, 545, 516, 555], [588, 546, 612, 555], [727, 545, 770, 556], [872, 545, 906, 556], [120, 557, 144, 566], [175, 558, 195, 566], [242, 557, 264, 566], [306, 557, 327, 566], [362, 557, 386, 566], [417, 558, 438, 566], [476, 557, 515, 566], [587, 557, 612, 566], [729, 557, 767, 566], [871, 557, 904, 567], [88, 640, 600, 658], [89, 658, 715, 672], [719, 660, 778, 670], [90, 675, 494, 692], [89, 691, 801, 709], [89, 711, 753, 726], [750, 709, 840, 727], [90, 728, 165, 739], [89, 744, 458, 760], [89, 762, 235, 776], [237, 766, 245, 772], [246, 762, 350, 776], [356, 762, 580, 777], [598, 763, 638, 774], [91, 809, 170, 823], [200, 807, 276, 820], [331, 825, 441, 840], [305, 545, 328, 555], [418, 546, 439, 555]]
Image: 2022_02_25_PMT428.png
model.config.vocab_size: 1024
tokenizer: 1024

do you have an idea about error?
thanks for your reply