Why do I get UnboundLocalError: local variable 'batch_idx' referenced before assignment when using interleaved data sets with Hugging Face (HF)?

brando · January 18, 2024, 11:31pm

The issue happened because llama2 does not have a max sequence length set. So it defaults to max int 1000000000000000019884624838656 . You can set it manually if you google the max seq len for your model e.g., for llama2-7b:

    # - Get tokenized train data set
    # Note: Setting `batched=True` in the `dataset.map` function of Hugging Face's datasets library processes the data in batches rather than one item at a time, significantly speeding up the tokenization and preprocessing steps.
    tokenized_train_datasets = raw_train_datasets.map(tokenize_function, batched=True, remove_columns=remove_columns)
    # block_size: int = tokenizer.model_max_length
    block_size: int = 4096

That should remove the issue in the code because

    total_length = (total_length // block_size) * block_size

will not be zero when trying to concatenate all 1000 texts in the batch when forming a batch all of which have a length of block size (assumign block size is max seq len or some value you chose).

ref: [Tokenizers]What this max_length number?

Topic		Replies	Views
UnboundLocalError: cannot access local variable 'input_ids' where it is not associated with a value 🤗Transformers	1	146	October 9, 2024
Error when instantiating dataloaders outside training_function 🤗Accelerate	5	1334	June 30, 2021
Transformers 4.35 save_pretrained function error: UnboundLocalError: local variable 'active_adapters' referenced before assignment Models	0	302	June 28, 2024
UnboundLocalError 🤗Datasets	2	23921	February 6, 2023
Finetuning a model using Autotrain error :( 🤗AutoTrain	0	248	March 27, 2024

Why do I get UnboundLocalError: local variable 'batch_idx' referenced before assignment when using interleaved data sets with Hugging Face (HF)?

Related topics