The issue happened because llama2 does not have a max sequence length set. So it defaults to max int 1000000000000000019884624838656
. You can set it manually if you google the max seq len for your model e.g., for llama2-7b:
# - Get tokenized train data set
# Note: Setting `batched=True` in the `dataset.map` function of Hugging Face's datasets library processes the data in batches rather than one item at a time, significantly speeding up the tokenization and preprocessing steps.
tokenized_train_datasets = raw_train_datasets.map(tokenize_function, batched=True, remove_columns=remove_columns)
# block_size: int = tokenizer.model_max_length
block_size: int = 4096
That should remove the issue in the code because
total_length = (total_length // block_size) * block_size
will not be zero when trying to concatenate all 1000 texts in the batch when forming a batch all of which have a length of block size (assumign block size is max seq len or some value you chose).