Fine-Tuned unsloth/Qwen2.5-1.5B Model Generating Unexpected Exclamation Marks

I’m encountering an issue with the fine-tuned unsloth/Qwen2.5-1.5B model, where the output includes unexpected exclamation marks (!) during text generation.

Process Followed:

  1. Fine-Tuning:
    After fine-tuning, I received the following files:
  • merges, tokenizer, training_args.bin, vocab, adapter_config, adapter_model.safetensors, added_tokens, README, special_tokens_map, tokenizer_config.
  1. Error on Generation:
    When I attempted to generate text, I encountered an error because config.json and model.safetensors were missing. I solved this by renaming adapter_model.safetensors to model.safetensors and added config.json from a Hugging Face meta model.
    Content of config.json:
    {
    “_name_or_path”: “/home/azureuser/CodeReview/Qwen2.5-finetuned_without_BSB/Qwen2.5-finetuned_without_BSB/”,
    “architectures”: [“Qwen2ForCausalLM”],
    “attention_dropout”: 0.0,
    “bos_token_id”: 151643,
    “eos_token_id”: 151643,
    “hidden_act”: “silu”,
    “hidden_size”: 1536,
    “initializer_range”: 0.02,
    “intermediate_size”: 8960,
    “max_position_embeddings”: 32768,
    “max_window_layers”: 28,
    “model_type”: “qwen2”,
    “num_attention_heads”: 12,
    “num_hidden_layers”: 28,
    “num_key_value_heads”: 2,
    “quantization_config”: {
    “_load_in_4bit”: false,
    “_load_in_8bit”: false,
    “quant_method”: “bitnet”
    },
    “rms_norm_eps”: 1e-06,
    “rope_theta”: 1000000.0,
    “vocab_size”: 151936
    }

Code:

Here’s my current code for generating responses:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()

Define the input structure

class CodeReviewInput(BaseModel):
diff: str

Load your locally saved fine-tuned model and tokenizer

model_path = “/home/azureuser/CodeReview/Qwen2.5-finetuned_without_BSB/Qwen2.5-finetuned_without_BSB/”
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype=torch.float32, device_map=“cpu”
)

torch.set_num_threads(4)

@app.post(“/code_review/”)
async def code_review(input_data: CodeReviewInput):
diff = input_data.diff.strip()
if not diff:
raise HTTPException(status_code=400, detail=“Input ‘diff’ cannot be empty.”)

alpaca_prompt = (
    "### Instruction:\n{0}\n\n"
    "### Input:\n{1}\n\n"
    "### Response:\n"
)

prompt = alpaca_prompt.format(
    "Review the code changes and provide feedback.",  
    diff  
)

inputs = tokenizer([prompt], return_tensors="pt", truncation=True, max_length=512)

try:
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,  
        no_repeat_ngram_size=2,  
        temperature=0.7,  
    )
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

    response_start = "### Response:"
    if response_start in generated_text:
        response = generated_text.split(response_start, 1)[-1].strip()
    else:
        response = "Error: Model output incomplete or malformed."

    return {"response": response}

except torch.cuda.OutOfMemoryError:
    raise HTTPException(
        status_code=500, detail="Model ran out of memory. Reduce input size."
    )
except Exception as e:
    raise HTTPException(
        status_code=500, detail=f"Unexpected error: {str(e)}"
    )

if name == “main”:
import uvicorn
uvicorn.run(app, host=“0.0.0.0”, port=8000)

  • How can I address the model generating exclamation marks unexpectedly?

I would appreciate any insights on how to fix this issue.

Thank you

1 Like

Do not rename it. Only the fine-tuned part of your fine-tuned data is in adapter_model.safetensors.
Please load the base model using PEFT and try using it with your adapter applied.

1 Like

Thank you very much, it’s working! Also, could you please tell me if there’s any way to reduce latency time of response(i am getting response around 1min)? I’m currently using an Virtual machine with 4 CPUs and 16 GB of RAM.

1 Like

It’s quite difficult to speed up LLM on a CPU environment. If you use a GPU, it can be made dozens of times faster, so there aren’t that many people working on speeding it up on a CPU-only basis…
The general speed-up techniques for PyTorch below are basically usable, but I also only use a few of them in practice.
There are also speed-ups limited to Intel CPUs, but I’ve never tried to see how much they can be used in a VM.:sweat_smile:
https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html

https://pytorch.org/tutorials/recipes/recipes/intel_extension_for_pytorch.html

1 Like