Fine-tune for function call on Meta-Llama-3.1-8B-Instruct

orkungedik · July 11, 2025, 6:58pm

Hi,

I am trying to fine-tune to make function call predictions better on Meta-Llama-3.1-8B-Instruct. To do that I created a dataset and applied steps regarding to Fine-Tuning Llama-3.1-8B for Function Calling using LoRA | by Gautam Chutani | Medium blog. As a result I can see function name and parameters are predicting perfectly, but now the model is generating weird answers [get_weather(city=“IL”)] regarding to prompt like “how are you?”.

Please find the code snippets below belong training;

import torch
from unsloth import FastLanguageModel

max_seq_length = 2048     # Unsloth auto supports RoPE Scaling internally!
dtype = None              # None for auto detection
load_in_4bit = False      # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,   # LoRA rank - suggested values: 8, 16, 32, 64, 128
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,   # Supports any, but = 0 is optimized
    bias="none",      # Supports any, but = "none" is optimized
    use_gradient_checkpointing="unsloth",  # Ideal for long context tuning
    random_state=3407,
    use_rslora=False,   # Disable rank-sensitive LoRA for simpler tasks
    loftq_config=None   # No LoftQ, for standard fine-tuning
)

from unsloth.chat_templates import get_chat_template

# Initialize the tokenizer with the chat template and mapping
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3",
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True,        # Maps <|im_end|> to <|eot_id|> instead
)

def formatting_prompts_func(examples):
    convos = []

    # Iterate through each item in the batch (examples are structured as lists of values)
    for query, tools, answers in zip(examples['query'], examples['tool'], examples['answer']):
        tool_user = {
            "content": f"You are a helpful assistant with access to the following tools or function calls. Your task is to produce a sequence of tools or function calls necessary to generate response to the user utterance. Use the following tools or function calls as required:\n{tools}",
            "role": "system"
        }
        ques_user = {
            "content": f"{query}",
            "role": "user"
        }
        assistant = {
            "content": f"{answers}",
            "role": "assistant"
        }
        convos.append([tool_user, ques_user, assistant])

    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return {"text": texts}

# Apply the formatting on dataset
dataset = dataset.map(formatting_prompts_func, batched = True,)

from transformers import TrainingArguments

args = TrainingArguments(
        per_device_train_batch_size = 8,  # Controls the batch size per device
        gradient_accumulation_steps = 2,  # Accumulates gradients to simulate a larger batch
        warmup_steps = 5,
        learning_rate = 2e-4,             # Sets the learning rate for optimization
        num_train_epochs = 2,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        optim = "adamw_8bit",
        weight_decay = 0.01,              # Regularization term for preventing overfitting
        lr_scheduler_type = "linear",     # Chooses a linear learning rate decay
        seed = 3407,
        output_dir = "outputs",
        logging_steps = 1,                # Sets frequency of logging to W&B
        logging_strategy = "steps",       # Logs metrics at each specified step
        save_strategy = "no",
        load_best_model_at_end = True,    # Loads the best model at the end
        report_to = "none",
        save_only_model = False           # Saves entire model, not only weights
    )

from trl import SFTTrainer

trainer = SFTTrainer(
    model = model,
    processing_class = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,        # Can make training 5x faster for short sequences.
    args = args
)

from unsloth import unsloth_train

trainer_stats = unsloth_train(trainer)
print(trainer_stats)

What I am missing?

Thank you for your helps

John6666 · July 12, 2025, 12:37am

Assuming that the model was trained using that prompt structure, I think it may have forgotten other conversation patterns. It has become overly specialized. How about mixing in negative examples such as the following?

{"query": "how are you?", 
 "tools": [], 
 "answer": "I’m doing well—thank you for asking!"}

orkungedik · July 13, 2025, 6:40pm

Hi,

I tried to fine-tune dataset with only two rows. Same thing happened.

The thing I found out that the fine-tuned model is able generate answers to simple questions. But problem occured with large RAG prompts.

Do you have any further idea about it?

Thank you for your helps.

John6666 · July 13, 2025, 11:28pm

I think this phenomenon is what is known as “catastrophic forgetting,” but I don’t think there is anything particularly wrong with your method…

Perhaps the learning rate is too high, or something like that?

orkungedik · July 14, 2025, 8:59am

Thank you my friend! I decreased learning rate = 1e-6 and it is better now. I learned a lot by your suggestions. Thank you again
Cheers

Orkun

chartar · July 14, 2025, 2:10pm

The primary issue you’re encountering stems from your training dataset and system prompt setup, which are biasing the model toward always generating function calls, even when they’re unnecessary.

During fine-tuning, the model never learned scenarios where no function call is needed. It overfits to the pattern of always outputting a tool call, leading to hallucinations like inventing irrelevant calls for casual prompts such as “how are you?”

Reload your dataset, add 1,000+ non-tool examples, and retrain.
If you’re still seeing weird outputs, share a sample of your dataset rows or inference code for more specific debugging.

system · July 15, 2025, 2:11am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bad Performance Finetuning Llama Chat and Instruct Models on GSM8K Beginners	5	1116	December 5, 2024
Performance problems with finetuned model (Llama 2 7B based) Beginners	3	687	June 10, 2024
Model Fine Tuning using Llama-2-7b-chat-hf not working for text-to-SQL task Beginners	0	303	June 14, 2024
Fine-tuning don't work / bad results Beginners	5	1692	January 15, 2025
Finetuning Meta-Llama-3.1-8B using PEFT Models	4	3462	February 1, 2025

Fine-tune for function call on Meta-Llama-3.1-8B-Instruct

Related topics