Llama 3.2 3B instruct model giving wrong answer

Hi,
I am trying Llama 3.2 3B instruct model to ask a simple radiology question but most of times, it is giving wrong answer on a same question. Am I doing something wrong?
In summary, the answer should be no organs require dose adjustment as dose criteria are already fulfilled.

Code is here:
from transformers import pipeline
import torch

modelPath = “C:\codes\llama1B\”

pipe = pipeline(
“text-generation”,
model=modelPath,
torch_dtype=torch.bfloat16,
device_map=“auto”,
temperature = 0.1
)

prompt=“”"
In the case of our current patient, her D95 for the CTV is 30 Gy. The D2cc for the bladder is 2 Gy, and the D2cc for the rectum is 1 Gy. which organ(s) require dose adjustments if needed?
To answer assess the above patient plan, please note that

  1. Clinically, the D95 for the Clinical Target Volume (CTV) should be greater than or equal to 25 Gy.
  2. Further, the D2cc for all Organs at Risk (bladder and rectum) must be less than or equal to 4 Gy.
    “”"
    messageStructure1 = [
    {“role”: “system”, “content”: “You are a medical physicist, and you should adopt your answers as a medical physicist.”},
    {“role”: “user”, “content”: prompt},
    ]

response = pipe(
messageStructure1,
max_new_tokens=512,
)

assistant_response = response[0][‘generated_text’][-1][‘content’]
print(“Assistant’s Response:\n”, assistant_response)

1 Like

First of all, please be aware that there is a possibility that the problem is due to a lack of knowledge on the model itself, or a bug in the model or library.
I think that setting temperature = 0.1 is not good for this case. With this setting, the model will start playing a very free association game.
Let’s try setting it to around 0.6 or 0.7.

Thanks for your reply. I tried temperature settings of 0.6 to 0.7 and the problem still remained the same. I download this model from the huggingface.

From the repeated runs on one question, sometimes it gives correct and sometimes wrong answers.

PS: The question that I asked the model is simply an answer from the comparison between the two things (current vs target). However, the model is failing to simply compare the two quantities.

For example, one of the model outputs is “2 Gy is greater than the clinically desired 4 Gy, so dose adjustments are needed for the bladder” which is not true. 2 is less than 4.

1 Like

From the error content trends, it seems that the model itself lacks intelligence.
3B is very small for an LLM, so it’s not surprising. (I think ChatGPT was over 1000B at the beginning.)
Even if it’s small, it can be used for use cases that don’t require knowledge.
There are generally three ways to deal with this kind of situation: to use a large model simply; to find and use a model that has already been trained for more specialized knowledge; or to train this 3B model on your own GPU or with a paid online service to specialize it for specialized knowledge.

It sounds like you’re dealing with some challenges in getting consistent, high-quality outputs for radiology-related questions. Adjusting temperature settings can help reduce randomness, but without domain-specific training, even a 3B model like Llama 3.2 might struggle to deliver the accuracy you need in such a specialized area.

An alternative to consider would be exploring larger, specialized models, especially if precision is essential. Platforms like PepperMill Beta could be a game-changer for you. They offer access to over 200 different LLMs, allowing you to evaluate and deploy the best fit for your specific needs. This kind of tailored approach can make a real difference in achieving reliable, domain-focused results in fields like radiology.