TRL - Fine tuned small model (facebook350m) yields many empty inferences

drbearcub · June 19, 2025, 7:35am

Hello,

I am using the trl library to fine tune a small model (FB 350m). I have ~4500 rows of data. I split my dataset into training and eval with a 9:1 ratio. After the training has stopped (eval loss stops improving) I ran inference on the eval set. Almost 30% yields no put during inference.

After processing, each row looks something like this:

{"prompt": "[Course]KBAI\n[PastChat]None\n[UserQuery]policy for late submission", "completion": "policy for late submission"}

The goal is to train a small model to rephrase the “userQuery” to “completion” based on past chat and other contexts. In many examples, “completion” is an exact copy of , a simple rephrase of , or sanitized version of “[UserQuery]”. In a few cases, “completion” would resolve pronouns in [UserQuery].

When I run the training script with [PastChat] removed from each row of the training data, the model would improve and less than 10% of inferences would be empty.

Is it a common occurrence that a fine-tuned model would often produce empty output during inference?

Many rows of my training data would be extremely long due to the [PastChat] section being super long. For example:

{"prompt": "[Course]KBAI\n[PastChat] roles and achieving a more accurate understanding.'}, {'user': 'can prepositions have ambiguous meanings in sentences, but nouns and verbs have single match each ?'}, {'assistant': 'Yes, it is possible for prepositions to have ambiguous meanings in sentences while nouns and verbs typically have single matches. Prepositions are words that establish relationships between other words in a sentence, such as indicating location, direction, time, or manner. The meaning of a preposition can sometimes be context-dependent and can lead to ambiguity.\\n\\nOn the other hand, nouns and verbs generally have more specific meanings and tend to have single matches in terms of their refer'}, {'user': 'do nouns and verbs always have a single meaning each?'}, {'assistant': 'It seems that your query may not be directly related to the course material. I suggest that you try rephrasing your question or asking another question that is more relevant to the course. This will help us provide you with the best possible assistance.'}, {'user': 'do nouns and verbs always have a single meaning each in sentences from the course'}, {'assistant': 'In the context of the course material, nouns and verbs typically have a single meaning each. However, it is important to note that there can be instances where nouns and verbs have multiple meanings or can be interpreted differently based on the context.\\n\\nThe course material focuses on various aspects of artificial intelligence, including knowledge representation, reasoning, and problem-solving. While discussing these topics, the course may provide examples where nouns and verbs have specific meanings within the given context.\\n\\nHowever, in natural language, words can often have multiple meanings or can be used in different ways depending on the context. This ambiguity is a challenge in natural language processing and understanding.\\n\\nTo ensure accurate interpretation, it is important to consider the context, surrounding words, and the overall meaning of the sentence when determining the specific meaning of nouns and verbs.'}, {'user': 'are action frames defined first and foremost by the agent executing the action'}, {'assistant': 'Yes, action frames are primarily defined by the agent executing the action. An action frame represents the knowledge and information associated with a specific action or event. It typically includes slots for the agent, the action itself, and other relevant details such as the object or target of the action, location, time, and any other relevant attributes.\\n\\nThe agent, or the entity performing the action, is a crucial component of an action frame. It specifies who or what is responsible for carrying out the action. The agent\\'s characteristics, capabilities, and goals can influence the specific details and constraints of the action frame.\\n\\nFor example, in the context of a \"running\" action, the agent could be a human, an animal, or even a machine. The specific abilities, motivations, and constraints of the agent will shape the details of the action frame associated with running.\\n\\nTherefore, the agent executing the action plays a central role in defining the action frame and determining its specific attributes and characteristics.'}, {'user': 'is there a primitive classification of types of state changes similar to classification of primitive actions?'}, {'assistant': 'Yes, similar to the classification of primitive actions, there can also be a primitive classification of types of state changes. In the context of knowledge representation and problem-solving, understanding the cause-and-effect relationship between actions and states is essential for an AI agent to make informed decisions and solve problems effectively.\\n\\nPrimitive actions are used to capture the meaning of sentences or stories in a simple and logical manner. They involve moving objects, changing their properties, or performing specific operations. Similarly, primitive state changes can represent the fundamental types of changes that can occur in a system or environment.\\n\\nThese primitive state changes can include:\\n\\n1. Creation: The state change where a new object or entity is brought into existence.\\n\\n2. Deletion: The state change where an object or entity is removed or ceases to exist.\\n\\n3. Modification: The state change where the properties or attributes of an object or entity are altered.\\n\\n4. Movement: The state change where an object or entity changes its location or position.\\n\\n5. Combination/Separation: The state change where multiple objects or entities are combined or separated.\\n\\n6. Transition: The state change where an object or entity moves from one state to another.\\n\\nThese primitive state changes can serve as a foundation for organizing knowledge and understanding the dynamics of a system. By recognizing and classifying these types of state changes, an AI agent can better analyze and reason about the effects of actions and make appropriate decisions in a given context.'}]\n[UserQuery]so is there a defined, prescribed, and finite number of actions, objects, or agents to use in action frames. Tell me for each object, actions, and agents", "completion": "so is there a defined, prescribed, and finite number of actions, objects, or agents to use in action frames. Tell me for each object, actions, and agents"}

Here is my training loss:

Step	Training Loss	Validation Loss
20	0.297400	0.263025
40	0.214900	0.260796
60	0.120700	0.223059
80	0.102600	0.218879
100	0.090100	0.207737
120	0.070300	0.266736
140	0.030900	0.239069
160	0.053700	0.244473
180	0.049400	0.260747
200	0.049800	0.266690
220	0.039100	0.240245
240	0.033100	0.249509
260	0.049500	0.246138
280	0.024600	0.264563
300	0.027900	0.251799

Here is my code:

!pip install trl
!pip install Dataset
!pip install json

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
from trl import SFTConfig, SFTTrainer
import json
import random
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, EarlyStoppingCallback
import os
import torch

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

from google.colab import drive
drive.mount('/content/drive')


file_path = '/content/drive/MyDrive/Colab Notebooks/restructured.jsonl'
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

# Load JSONL into a list of JSON objects
data = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        data.append(json.loads(line))

#MAX_CHARS_FOR_PAST_CHAT = 5000 # ~ 1500 tokens
MAX_CHARS_FOR_PAST_CHAT = 5000 # ~ 1500 tokens

def data_transform_func(example):
    course_str = example['course']
    user_query_str = example['UserQuery']
    resolved_query_str = example['ResolvedQuery']
    past_chat_str = example['pastChat']

    # Past chat can be really, really long
    # Low budget way to make sure training data is less than 2048 tokens
    if len(past_chat_str) > MAX_CHARS_FOR_PAST_CHAT:
        past_chat_str = past_chat_str[-MAX_CHARS_FOR_PAST_CHAT:]

    prompt =  f"[Course]{course_str}\n[PastChat]{past_chat_str}\n[UserQuery]{user_query_str}"

    return {"prompt": prompt, "completion": resolved_query_str}


random.shuffle(data)



# process and save data to good driver
train_dataSet_processed = list(map(data_transform_func, data))
output_file_path = '/content/drive/MyDrive/Colab Notebooks/restructured_processed.jsonl'
with open(output_file_path, 'w', encoding='utf-8') as f:
    for item in train_dataSet_processed:
        json.dump(item, f, ensure_ascii=False)
        f.write('\n')



# load data back from google drive
loaded_dataset = load_dataset("json", data_files="/content/drive/MyDrive/Colab Notebooks/restructured_processed.jsonl", split="train")
train_test_split = loaded_dataset.train_test_split(test_size=0.1)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

output_file_path = '/content/drive/MyDrive/Colab Notebooks/restructured_processed_train.jsonl'
with open(output_file_path, 'w', encoding='utf-8') as f:
    for item in train_dataset:
        json.dump(item, f, ensure_ascii=False)
        f.write('\n')

output_file_path = '/content/drive/MyDrive/Colab Notebooks/restructured_processed_test.jsonl'
with open(output_file_path, 'w', encoding='utf-8') as f:
    for item in eval_dataset:
        json.dump(item, f, ensure_ascii=False)
        f.write('\n')

# load jsonl dataset back from drive
train_dataset = load_dataset("json", data_files="/content/drive/MyDrive/Colab Notebooks/restructured_processed_train.jsonl", split="train")
eval_dataset = load_dataset("json", data_files="/content/drive/MyDrive/Colab Notebooks/restructured_processed_test.jsonl", split="train")


# Actual training script

#model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", torch_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=14,
    gradient_accumulation_steps=6,

    max_steps=100000,
    logging_steps=1,
    report_to=[],
    #bf16=True,

    # --- Early Stopping and Evaluation Arguments ---
    eval_strategy="steps",
    eval_steps=20,
    save_strategy="steps",
    save_steps=240,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,

    #optim="paged_adamw_32bit",
    # learning_rate=2e-4,
    # lr_scheduler_type="cosine",
    # warmup_ratio=0.03,
    # fp16=True,
)

#load train_dataSet_processed into model
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=10, early_stopping_threshold=0.0)],
)

trainer.train()

#Save trained model to drive
output_directory = "/content/drive/MyDrive/Colab Notebooks/anotheroverfitted_model"
trainer.save_model(output_directory)

#load model from drive
from transformers import pipeline
model = AutoModelForCausalLM.from_pretrained("/content/drive/MyDrive/Colab Notebooks/anotheroverfitted_model")
tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/Colab Notebooks/anotheroverfitted_model")
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Use trained model to run inference on eval data
import transformers
from tqdm import tqdm

transformers.logging.set_verbosity_error()

good_sample = 0
stuff = []
for eval_sample in tqdm(eval_dataset):
  eval_prompt = eval_sample['prompt']
  #eval_completion = eval_sample['completion']
  eval_result = generator(eval_sample['prompt'], do_sample=False, max_new_tokens=200)
  #print("eval result is", eval_result)
  generated_text = eval_result[0]['generated_text']
  #print(generated_text[len(eval_prompt):])
  stuff.append({"prompt": eval_prompt, "completion": eval_sample['completion'], "prediction" : generated_text[len(eval_prompt):]})
  if generated_text[len(eval_prompt):] == eval_sample['completion']:
    good_sample += 1
  print(generated_text[len(eval_prompt):])


print("num good samples: ", good_sample)
print(stuff)

John6666 · June 19, 2025, 7:57am

The options to be given to the trainer may be quite different from other models?
https://stackoverflow.com/questions/76857722/huggingface-sft-for-completion-only-not-working

github.com/huggingface/trl

`RewardTrainer` hits NaN output with quantized pretrained model

opened 04:45AM - 21 Jan 24 UTC

closed 03:05PM - 08 Mar 24 UTC

chenmoneygithub

Hi team, I am trying to tune my reward model (`opt-350m`) via `RewardTrainer`…, while it works fine without applying quantization, using int4 + lora hit the NaN problem after <10 steps. Here is the reproducible code: [github gist](https://colab.research.google.com/gist/chenmoneygithub/a2e0895b0dba6e49b440686cda5ed01b/reproduce-reward-model-nan-issue.ipynb?authuser=1#scrollTo=FFTBZa0iTkwf), tested on A100 environment. Could anyone provide any insight? I am also wondering how people debug quantization-related issue when using HuggingFace trainer, is it possible to print out the gradients, and outputs of certain layers inside the model? Thanks a lot!

Topic		Replies	Views
mT5 Question/Answering fine tuning is generating empty sentences during inference 🤗Transformers	2	656	June 2, 2024
Inference from a fine-tuned model -- help with interpretation of results Beginners	3	366	January 26, 2024
When I try to use my fine-tuned Causal LM model to inference a prompt, I get nothing but the last word repeated multiple times 🤗Transformers	1	514	April 14, 2024
Dataset parameters to finetune a pretrained translation model on new vocabulary Models	0	363	July 5, 2023
Finetuning T5-small delivers incorrect outputs after finetuning 🤗Transformers	1	364	July 4, 2023

TRL - Fine tuned small model (facebook350m) yields many empty inferences

Related topics