Repetitive Token Generation During Evaluation in Fine-Tuned LLaMA Model

Maplabai · March 6, 2025, 8:46pm

I’m fine-tuning a LLaMA based model (Llama-3.3-70B-Instruct) to generate Overpass Turbo queries (a query language for extracting specific geographic data from OpenStreetMap) from natural language prompts. For experimental reasons, I call .generate() inside trainer.evaluate() to track the model’s predictions during evaluation and compare them with its direct logits output. However, I notice that the model’s raw predictions (from logits) are highly repetitive, while .generate() produces more coherent outputs.

I wanted to check if there is an issue somewhere, whether in how evaluation is handled, my setup, or something else. Any insights would be appreciated!

My evaluation function:

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    
    # Convert logits to token IDs
    predictions = np.argmax(logits, axis=-1)
    
    # Debug: Check raw logits and token IDs
    print("Raw Logits:", logits.shape)
    print("Predictions Token IDs:", predictions)
    
    # Remove ignored index (-128004) from labels
    labels = [[token for token in label if token != -128004] for label in labels]

    # Convert token IDs back to text
    predictions_text = [tokenizer.decode(pred, skip_special_tokens=True) for pred in predictions]
    labels_text = [tokenizer.decode(label, skip_special_tokens=True) for label in labels]
    
    # Debug: Print model's predictions and compare it to model's generate output
    DEVICE, _, _ = get_backend() 
    test_inputs = tokenizer(["Generate an Overpass Turbo query to find all basketball courts in Montreal."], return_tensors="pt").to(DEVICE)
    test_output_ids = model.generate(**test_inputs)
    test_output_text = [tokenizer.decode(output, skip_special_tokens=True) for output in test_output_ids]
    for i in range(len(test_output_text)):
        print(f"✅ test prediction {i}: {test_output_text[i]}")
        print("*" * 50)

    for i in range(len(predictions_text)):
        print(f"✅ Prediction {i}: {predictions_text[i]}")
        print(f"✅ Label {i}: {labels_text[i]}")
        print("-" * 50)

    # Compute and return metrics 
   ...

Example output of model.generate on top vs output of eval logits on bottom

Other key parts of my finetuning script

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
model =  LlamaForCausalLM.from_pretrained(
    BASE_MODEL, 
    use_safetensors=True,
    torch_dtype=torch.bfloat16)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none"
)

model.enable_input_require_grads()
model = get_peft_model(model, config)

def preprocess(batch):
    inputs = [
        f"Using this data {batch['system'][i]}, generate overpass turbo query: {batch['prompt'][i]}"
        if batch['system'][i] else f"Generate overpass turbo query: {batch['prompt'][i]}"
        for i in range(len(batch['prompt']))
    ]
    
    model_inputs = tokenizer(
        inputs,
        text_target=batch["completion"],
        padding="max_length",
        max_length=256, 
    )
    
    return model_inputs

tokenized_op_train = op_train.map(
    preprocess, batched=True, remove_columns=op_train.column_names)
tokenized_op_test = op_test.map(
    preprocess, batched=True, remove_columns=op_test.column_names)

data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model
)

training_args = TrainingArguments(
    num_train_epochs = epochs,
    output_dir=str(VOL_MOUNT_PATH / "model"),
    logging_dir=str(VOL_MOUNT_PATH / "logs"),
    metric_for_best_model="exact_match",
    logging_strategy="steps",
    logging_steps=10,
    eval_strategy="steps",
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    bf16=True,
    learning_rate=3e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,
    label_names=["labels"]
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_op_train,
    eval_dataset=tokenized_op_test,
    compute_metrics=compute_metrics
)

try:
    resume = restarts > 1
    if resume:
        print("resuming from checkpoint")
    trainer.train(resume_from_checkpoint=False)
except KeyboardInterrupt:  # handle possible preemption
    print("received interrupt; saving state and model")
    trainer.save_state()
    trainer.save_model()
    raise

model.save_pretrained(str(VOL_MOUNT_PATH / MODEL_NAME), safe_serialization=True)
tokenizer.save_pretrained(str(VOL_MOUNT_PATH / FINETUNED_MODEL_NAME))
output_vol.commit()

...

John6666 · March 6, 2025, 10:47pm

This is a problem that we often hear about with fine tuning of Llama 3, but this time it doesn’t seem to be a problem derived from the base, so I wonder if this is it.

github.com/NVIDIA/TensorRT-LLM

Repeated outputs for long input tasks on Llama 3 70B compared to vLLM and HF's transformers

opened 06:31PM - 16 Jun 24 UTC

closed 03:57PM - 11 Aug 24 UTC

DreamGenX

bug waiting for feedback

### System Info - This was tested o na tp=4 4xH100 SXM setup - I tested thes…e 2 releases: https://github.com/NVIDIA/TensorRT-LLM/pull/1763 and https://github.com/NVIDIA/TensorRT-LLM/pull/1725 ### Who can help? _No response_ ### Information - [X] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction I built TensorRT-LLM engine in several different ways, outlined below, and compared the output quality on domain specific task that involves long inputs (typically >>2000 input tokens and >500 output tokens). The outputs from TensorRT-LLM (obtained through running the `run.py` script, as well as through running the GptManager in all different modes: V1, InflightBatching, InflightFusedBatching) exhibit repepetition in the outputs ~20% of the time (sample outputs below). When running the same with vLLM, using the same sampling params (namely temperature, presencePenalty and frequencyPenalty), the outputs do not exhibit these repetitive patterns. Here some some of the ways I tried to build the TensorRT-LLM engine: - I tried all of float16, bfloat16 and also fp8 quantization - I tried `context_fmha enable/disable` and and also `context_fmha_fp32_acc enable/disable` - I tried `use_custom_all_reduce enable/disable` - I tried `gemm_plugin auto/disable` - I tried various values for `presencePenalty` and `frequencyPenalty` (unset, 0.05, 0.1, 0.3), bust most tests were with `0.1` for both One concrete example: ``` python convert_checkpoint.py \ --model_dir /workspace/llama3-70b \ --output_dir /workspace/llama3-70b-bf16-tp4 \ --dtype bfloat16 \ --tp_size 4 trtllm-build \ --checkpoint_dir /workspace/llama3-70b-bf16-tp4 \ --output_dir /workspace/llama3-70b-bf16-tp4-engine \ --gpt_attention_plugin bfloat16 \ --gemm_plugin bfloat16 \ --use_custom_all_reduce disable \ --max_num_tokens 16384 \ --max_batch_size 24 \ --max_input_len 8192 \ --max_output_len 4096 ``` I also tried running sequentially without batching, and even building the engine with `max_batch_size 1` to eliminate the possibility of batching related bugs (I saw there were a few before). I also once tried building with `max_input_len 7424` and `max_output_len 768` to eliminate the possibility of somehow messing up the RoPE (not sure if max_input_len and max_output_len actually affect that or not). ### Expected behavior The outputs should not loop that frequently, there's likely some inference inaccuracy / mismatch. ### actual behavior The input would usually be some part of a story + instruction to continue the story. This is an example of an output. ``` She looks up when she hears me set down her drink. “Martini,” I say with a smile. She smiles back at me with her eyes this time. “Thank you,” she says. I don’t know what it is about her voice that makes me feel like she’s saying something else entirely. I don’t know what it is about her voice that makes me feel like she’s saying something else entirely. I don’t know what it is about her voice that makes me feel like she’s saying something else entirely. I don’t know what it is about her voice that makes me feel like she’s saying something else entirely. I don’t know what it is about her voice that makes me feel like she’s saying something else entirely. I don’t know what it is about her voice that makes me feel like she’s saying something else entirely. ``` The repetition is usually at a sentence level like this, but sometimes also several sentences repeat. ### additional notes I am wondering if anyone else experienced similar issues, and whether someone did a recent analysis comparing Tensort-LLM to other inference stacks. I saw that most tests are restricted to short inputs and outputs like MMLU, which might not exhibit these issues.

Topic		Replies	Views
Custom evaluation during Llama2 fine tuning Beginners	1	1056	January 17, 2024
Fine-tune Llama2 evaluation Beginners	0	561	November 27, 2023
Unisloth 4-bit Llama models acting weirdly when used in a Function Beginners	0	165	May 8, 2024
Understanding Output of `PreTrainedModel.forward` Beginners	2	1909	February 12, 2024
Trainer.evaluate() with text generation Beginners	5	3522	December 31, 2021

Repetitive Token Generation During Evaluation in Fine-Tuned LLaMA Model

Related topics