Why do I get UnboundLocalError: local variable 'batch_idx' referenced before assignment when using interleaved data sets with Hugging Face (HF)?

The issue happened because llama2 does not have a max sequence length set. So it defaults to max int 1000000000000000019884624838656 . You can set it manually if you google the max seq len for your model e.g., for llama2-7b:

    # - Get tokenized train data set
    # Note: Setting `batched=True` in the `dataset.map` function of Hugging Face's datasets library processes the data in batches rather than one item at a time, significantly speeding up the tokenization and preprocessing steps.
    tokenized_train_datasets = raw_train_datasets.map(tokenize_function, batched=True, remove_columns=remove_columns)
    # block_size: int = tokenizer.model_max_length
    block_size: int = 4096

That should remove the issue in the code because

    total_length = (total_length // block_size) * block_size

will not be zero when trying to concatenate all 1000 texts in the batch when forming a batch all of which have a length of block size (assumign block size is max seq len or some value you chose).

ref: [Tokenizers]What this max_length number?