Fine-Tuned unsloth/Qwen2.5-1.5B Model Generating Unexpected Exclamation Marks

pranay32 · December 9, 2024, 7:17am

I’m encountering an issue with the fine-tuned unsloth/Qwen2.5-1.5B model, where the output includes unexpected exclamation marks (!) during text generation.

Process Followed:

Fine-Tuning:
After fine-tuning, I received the following files:

merges, tokenizer, training_args.bin, vocab, adapter_config, adapter_model.safetensors, added_tokens, README, special_tokens_map, tokenizer_config.

Error on Generation:
When I attempted to generate text, I encountered an error because config.json and model.safetensors were missing. I solved this by renaming adapter_model.safetensors to model.safetensors and added config.json from a Hugging Face meta model.
Content of config.json:
{
“_name_or_path”: “/home/azureuser/CodeReview/Qwen2.5-finetuned_without_BSB/Qwen2.5-finetuned_without_BSB/”,
“architectures”: [“Qwen2ForCausalLM”],
“attention_dropout”: 0.0,
“bos_token_id”: 151643,
“eos_token_id”: 151643,
“hidden_act”: “silu”,
“hidden_size”: 1536,
“initializer_range”: 0.02,
“intermediate_size”: 8960,
“max_position_embeddings”: 32768,
“max_window_layers”: 28,
“model_type”: “qwen2”,
“num_attention_heads”: 12,
“num_hidden_layers”: 28,
“num_key_value_heads”: 2,
“quantization_config”: {
“_load_in_4bit”: false,
“_load_in_8bit”: false,
“quant_method”: “bitnet”
},
“rms_norm_eps”: 1e-06,
“rope_theta”: 1000000.0,
“vocab_size”: 151936
}

Code:

Here’s my current code for generating responses:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()

Define the input structure

class CodeReviewInput(BaseModel):
diff: str

Load your locally saved fine-tuned model and tokenizer

model_path = “/home/azureuser/CodeReview/Qwen2.5-finetuned_without_BSB/Qwen2.5-finetuned_without_BSB/”
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype=torch.float32, device_map=“cpu”
)

torch.set_num_threads(4)

@app.post(“/code_review/”)
async def code_review(input_data: CodeReviewInput):
diff = input_data.diff.strip()
if not diff:
raise HTTPException(status_code=400, detail=“Input ‘diff’ cannot be empty.”)

alpaca_prompt = (
    "### Instruction:\n{0}\n\n"
    "### Input:\n{1}\n\n"
    "### Response:\n"
)

prompt = alpaca_prompt.format(
    "Review the code changes and provide feedback.",  
    diff  
)

inputs = tokenizer([prompt], return_tensors="pt", truncation=True, max_length=512)

try:
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,  
        no_repeat_ngram_size=2,  
        temperature=0.7,  
    )
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

    response_start = "### Response:"
    if response_start in generated_text:
        response = generated_text.split(response_start, 1)[-1].strip()
    else:
        response = "Error: Model output incomplete or malformed."

    return {"response": response}

except torch.cuda.OutOfMemoryError:
    raise HTTPException(
        status_code=500, detail="Model ran out of memory. Reduce input size."
    )
except Exception as e:
    raise HTTPException(
        status_code=500, detail=f"Unexpected error: {str(e)}"
    )

if name == “main”:
import uvicorn
uvicorn.run(app, host=“0.0.0.0”, port=8000)

How can I address the model generating exclamation marks unexpectedly?

I would appreciate any insights on how to fix this issue.

Thank you

John6666 · December 9, 2024, 10:33am

Do not rename it. Only the fine-tuned part of your fine-tuned data is in adapter_model.safetensors.
Please load the base model using PEFT and try using it with your adapter applied.

pranay32 · December 10, 2024, 7:19am

Thank you very much, it’s working! Also, could you please tell me if there’s any way to reduce latency time of response(i am getting response around 1min)? I’m currently using an Virtual machine with 4 CPUs and 16 GB of RAM.

John6666 · December 10, 2024, 7:35am

It’s quite difficult to speed up LLM on a CPU environment. If you use a GPU, it can be made dozens of times faster, so there aren’t that many people working on speeding it up on a CPU-only basis…
The general speed-up techniques for PyTorch below are basically usable, but I also only use a few of them in practice.
There are also speed-ups limited to Intel CPUs, but I’ve never tried to see how much they can be used in a VM.
https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html

gist.github.com

https://gist.github.com/mingfeima/f040ede25b4797740634ab778b2f5888

part_1_memory_format_and_channels_last_optimization.md

## Part I: Memory Formats and Channels Last Optimization
_(Training material on pytorch CPU performance optimization)_


* **Part II**: [Parallelization Techniques](https://gist.github.com/mingfeima/664a065cc994318681f6a632c849e1fa)
* **Part III**: [Vectorization Techniques](https://gist.github.com/mingfeima/6205bc3f2676ce23c1e5cb9d2672a9ce)
* **Part IV**: [BFloat16 Kernel Optimization](https://gist.github.com/mingfeima/fcd3c89e32983c6d7033693cea046e4a)

Chinese version for this chapter, [link](https://zhuanlan.zhihu.com/p/494620090).

This file has been truncated. show original

https://pytorch.org/tutorials/recipes/recipes/intel_extension_for_pytorch.html

Topic		Replies	Views
Weird error when trying to generate response from fine-tuned model Beginners	1	202	April 2, 2024
Fintune whisper model returns exclamation marks 🤗Transformers	1	542	August 7, 2023
Getting wrong response after fine tuning Google/t5-v1_1-base 🤗Transformers	0	170	April 17, 2023
Fine-tuned transformers model generats nonsensical results Beginners	0	219	July 10, 2024
Model broken after fine tuning Beginners	0	375	January 12, 2024

Fine-Tuned unsloth/Qwen2.5-1.5B Model Generating Unexpected Exclamation Marks

Process Followed:

Code:

Define the input structure

Load your locally saved fine-tuned model and tokenizer

Related topics