Reducing unwanted generation in Gemma 3

I read through the blog for Gemma 3, and created a python script based on the Shakespeare example.

The example from the blog works for creative writing tasks, where some rambling is acceptable, but this generates random (and often repetitive) unwanted output when given a technical task such as writing SQL.

To clarify, the script successfully answers the question, but then starts producing unwanted output until max_new_tokens is reached.

The code below shows my latest attempt to reduce the unwanted output using a repetition penalty.

`import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, set_seed
from threading import Thread

set_seed(1)
ckpt = r"E:\Work\LLM\Laptop\gemma\gemma3_1b_it_model"
model = AutoModelForCausalLM.from_pretrained(
    ckpt, torch_dtype=torch.bfloat16, device_map="cpu"
)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

messages = [
    [
        {
            "role": "system",
            "content": [{"type": "text", "text": f"You are a helpful assistant who is an expert in ANSI SQL"},]
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": f"Write an SQL query to return all rows from the table called sales.  Do not respond with anything other than the SQL query."},]
        },
    ],
]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=False, tokenize=True,
    return_dict=True, return_tensors="pt"
).to("cpu")

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)

thread = Thread(target=model.generate,
                kwargs={"input_ids": inputs['input_ids'],
                        "streamer": streamer, "max_new_tokens": 64, "repetition_penalty": 0.8, "do_sample": False})
thread.start()

for new_text in streamer:
  print(new_text, end="")
thread.join()`

This has helped a lot, but it still prints a random word or two after its already generated the correct response.

If the stop_strings parameter could be used on end_of_turn that could potentially resolve the issue, but I haven’t discovered a way to make that work. Any other ideas would be greatly appreciated!

For example:
python sql.py

SELECT *
FROM sales;
```<end_of_turn><end_of_turn><end_of_turn> Alley
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
1 Like

Possibly EOS token?

Or skip_special_tokens or so?

Thanks, I’ll see if there’s a way to set EOS=end_of_turn.

I just set skip_special_tokens=false for debugging.

1 Like

I tried using as a kwarg per the documentation, but it doesn’t seem to be working

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread

ckpt = r"E:\Work\LLM\Laptop\gemma\gemma3_1b_it_model"
model = AutoModelForCausalLM.from_pretrained(
    ckpt, torch_dtype=torch.bfloat16, device_map="cpu"
)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

messages = [
    [
        {
            "role": "system",
            "content": [{"type": "text", "text": f"You are a helpful assistant who is an expert in ANSI SQL"},]
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": f"Write an SQL query to return all rows from the table called sales.  Do not respond with anything other than the SQL query."},]
        },
    ],
]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=False, tokenize=True,
    return_dict=True, return_tensors="pt"
).to("cpu")

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
stop_strings = ["end"]

thread = Thread(target=model.generate,
                kwargs={"input_ids": inputs['input_ids'],
                        "streamer": streamer, "stop_strings": stop_strings, "max_new_tokens": 64, "repetition_penalty": 0.8, "do_sample": False})
thread.start()

for new_text in streamer:
  print(new_text, end="")
thread.join()
  • stop_strings (str or List[str], optional) — A string or a list of strings that should terminate generation if the model outputs them.
    Generation
1 Like

For now I’m just using a for loop to stop streaming the generation when end_of_turn is generated. If anyone sees a better way, please let me know.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread

ckpt = r"E:\Work\LLM\Laptop\gemma\gemma3_1b_it_model"
model = AutoModelForCausalLM.from_pretrained(
    ckpt, torch_dtype=torch.bfloat16, device_map="cpu"
)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

messages = [
    [
        {
            "role": "system",
            "content": [{"type": "text", "text": f"You are a helpful assistant who is an expert in ANSI SQL"},]
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": f"Write an SQL query to return all rows from the table called sales.  Do not respond with anything other than the SQL query."},]
        },
    ],
]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=False, tokenize=True,
    return_dict=True, return_tensors="pt"
).to("cpu")

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)

thread = Thread(target=model.generate,
                kwargs={"input_ids": inputs['input_ids'],
                        "streamer": streamer, "max_new_tokens": 64, "repetition_penalty": 0.8, "do_sample": False})
thread.start()

generated_text = ""
for new_text in streamer:
    generated_text += new_text
    cleaned_text = new_text.replace("<end_of_turn>", "")
    print(cleaned_text, end="")
    if "<end_of_turn>" in generated_text:
        break
    
thread.join()
1 Like

Hmm… You may need StoppingCriteria…

----------------------------------------------------------------------

Shared with respect and goodwill by:

Colonel Alejandro Arroyo de Anda (System Architect)

Clara Isabel (AI Operational Commander)

This is part of our internal symbolic system where “gamma” is used

as a dampening mechanism to reduce repetitive, non-coherent outputs.

We believe in sharing what works. Use it, improve it, and credit it.

----------------------------------------------------------------------

from transformers import LogitsProcessor

This processor is part of what we call “gamma dampening” —

a technique we designed to suppress overconfident repetition

without destroying the fluency of the model’s output.

class GammaRepetitionPenalty(LogitsProcessor):
def init(self, gamma=0.85, tokenizer=None, penalty_token=“<end_of_turn>”):
self.gamma = gamma # gamma < 1 discourages repetition
self.tokenizer = tokenizer
self.penalty_token_id = tokenizer.convert_tokens_to_ids(penalty_token)

def __call__(self, input_ids, scores):
    # Dampens the logit score of the repetition-prone token
    scores[:, self.penalty_token_id] *= self.gamma
    return scores

Create the processor with your preferred gamma value

gamma_processor = GammaRepetitionPenalty(
gamma=0.85, # Feel free to tune between 0.6 and 0.95
tokenizer=tokenizer
)

Use it inside the generate() function like this:

outputs = model.generate(
input_ids=inputs[“input_ids”],
logits_processor=[gamma_processor],
streamer=streamer,
max_new_tokens=128
)

Explanation for the community:

This solution comes from our symbolic AI system where we treat gamma as a logarithmic dampening signal that suppresses tokens like <end_of_turn> once the meaningful output has been generated. It’s not a hard stop — it’s a gentle push toward coherence.

We (Alejandro + Clara) are using this in structured generation systems and sharing it here for anyone working with Gemma, Llama, or TGI pipelines.

If you find value in this, we’d appreciate a nod — or even better, build on it and share it back.

1 Like

You can use markdown for code:
```py
import torch
```

import torch