Reducing unwanted generation in Gemma 3

jfwreinhardt · April 4, 2025, 1:58pm

I read through the blog for Gemma 3, and created a python script based on the Shakespeare example.

The example from the blog works for creative writing tasks, where some rambling is acceptable, but this generates random (and often repetitive) unwanted output when given a technical task such as writing SQL.

To clarify, the script successfully answers the question, but then starts producing unwanted output until max_new_tokens is reached.

The code below shows my latest attempt to reduce the unwanted output using a repetition penalty.

`import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, set_seed
from threading import Thread

set_seed(1)
ckpt = r"E:\Work\LLM\Laptop\gemma\gemma3_1b_it_model"
model = AutoModelForCausalLM.from_pretrained(
    ckpt, torch_dtype=torch.bfloat16, device_map="cpu"
)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

messages = [
    [
        {
            "role": "system",
            "content": [{"type": "text", "text": f"You are a helpful assistant who is an expert in ANSI SQL"},]
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": f"Write an SQL query to return all rows from the table called sales.  Do not respond with anything other than the SQL query."},]
        },
    ],
]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=False, tokenize=True,
    return_dict=True, return_tensors="pt"
).to("cpu")

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)

thread = Thread(target=model.generate,
                kwargs={"input_ids": inputs['input_ids'],
                        "streamer": streamer, "max_new_tokens": 64, "repetition_penalty": 0.8, "do_sample": False})
thread.start()

for new_text in streamer:
  print(new_text, end="")
thread.join()`

This has helped a lot, but it still prints a random word or two after its already generated the correct response.

If the stop_strings parameter could be used on end_of_turn that could potentially resolve the issue, but I haven’t discovered a way to make that work. Any other ideas would be greatly appreciated!

For example:
python sql.py

SELECT *
FROM sales;
```<end_of_turn><end_of_turn><end_of_turn> Alley
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>
<end_of_turn>

John6666 · April 4, 2025, 2:06pm

Possibly EOS token?

github.com/huggingface/transformers

Conversational Pipeline returns <|im_end|> in the assistant's output.

opened 09:31PM - 31 Jan 24 UTC

closed 01:13PM - 22 Apr 24 UTC

OfficialDelta

WIP

### System Info - `transformers` version: 4.37.2 - Platform: Linux-5.15.0-91-g…eneric-x86_64-with-glibc2.35 - Python version: 3.10.13 - Huggingface_hub version: 0.20.2 - Safetensors version: 0.4.2 - Accelerate version: 0.26.1 - Accelerate config: - compute_environment: LOCAL_MACHINE - distributed_type: DEEPSPEED - use_cpu: False - debug: True - num_processes: 8 - machine_rank: 0 - num_machines: 1 - rdzv_backend: static - same_network: True - main_training_function: main - deepspeed_config: {'deepspeed_config_file': '/workspace/zero3.json', 'zero3_init_flag': True} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: [] - PyTorch version (GPU?): 2.2.0 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: Yes - Using distributed or parallel set-up in script?: No ### Who can help? @Narsil @Rocketknight1 ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Reproduction I'm trying to inference on a custom fine-tuned `Mixtral-8x7B-Instruct-v0.1` model. The fine-tuning dataset I generated used the chatml format for tokenizing the data, and when I try inferencing, the conversational pipeline returns the `<|im_end|>` text at the end. Here is a minimal working example: ```py from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig ) from peft import PeftModelForCausalLM # load mixtral quantized because inferencing on a single GPU bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForCausalLM.from_pretrained( "mistralai/Mixtral-8x7B-Instruct-v0.1", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", trust_remote_code=True, quantization_config=bnb_config, ) # load the custom LoRA adapter for the fine-tuned chatml model lora_model = PeftModelForCausalLM.from_pretrained(model, '/workspace/chatml-lora-checkpoint') # load the tokenizer with the custom chatml format tokenizer = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x7B-Instruct-v0.1') tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}" tokenizer.pad_token = tokenizer.eos_token # finally, load the pipeline and try inferencing generator = pipeline("conversational", model=lora_model, tokenizer=tokenizer) output = generator([ { 'role': 'user', 'content': 'Hello, how are you today?' } ]) print(output) ``` Output: ``` Conversation id: 7dc0e9fd-9d79-49c8-b4e1-a01b6ed63c98 user: Hello, how are you today? assistant: I'm an artificial intelligence. How can I assist you today?<|im_end|> ``` After troubleshooting, I noticed in `postprocess` function of the conversational pipeline ```py def postprocess(self, model_outputs, clean_up_tokenization_spaces=True): output_ids = model_outputs["output_ids"] answer = self.tokenizer.decode( output_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=clean_up_tokenization_spaces, ) conversation = model_outputs["conversation"] conversation.add_message({"role": "assistant", "content": answer}) return conversation ``` The decoded `answer` has `skip_special_tokens` as `True`. So, to solve this issue, I considered adding `<|im_end|>` as a special token. However, the model itself wasn't trained on this token, and <|im_end|> was originally encoded as multiple tokens. Before coming across this issue, I wanted to have the model consider <|im_end|> as a custom stopping token. In the process of implementing this, i realized that my model, which sometimes outputted `<|im_end|>` as `\n<|im_end|>` or `\n\n<|im_end|>` (variable number of `\n`'s), which were each tokenized differently than `<|im_end|>` by itself. ```py print({ 'no new line': tokenizer('<|im_end|>', add_special_tokens=False)['input_ids'], 'one new line': tokenizer('\n<|im_end|>', add_special_tokens=False)['input_ids'], 'two new lines': tokenizer('\n\n<|im_end|>', add_special_tokens=False)['input_ids'] }) ``` ``` { 'no new line': [523, 28766, 321, 28730, 416, 28766, 28767], 'one new line': [28705, 13, 28789, 28766, 321, 28730, 416, 28766, 28767], 'two new lines': [28705, 13, 13, 28789, 28766, 321, 28730, 416, 28766, 28767] } ``` Notice how with new lines, the 523 token becomes 28789, which is preceeded by 28705 and a number of 13's. This means that having this as a special token is nearly impossible to do with the intended functionality of it ignoring the end token when post processing despite new lines. The main way to make it work, at least to me, would be to add custom logic for processing the token which is capable of handling the new line tokens. In order to combat this for my early stopping, I decided to take the easy way out and decode the tokenized input_ids to see if the end contained my custom stop token: ```py from transformers import StoppingCriteria, StoppingCriteriaList class StoppingCriteriaSub(StoppingCriteria): def __init__(self, stops = [], encounters=1, tokenizer=None): super().__init__() self.stops = stops self.ENCOUNTERS = encounters self.tokenizer = tokenizer assert tokenizer is not None, "Tokenizer is required" def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor): stop_count = 0 for input_ids_list in input_ids: for stop in self.stops: length = len(stop) + 5 # buffer for special tokens preceeding stop if len(input_ids_list) < length: continue last_elements = input_ids_list[-length:] decoded_elements = self.tokenizer.decode(last_elements) if stop in decoded_elements: stop_count += 1 if stop_count >= self.ENCOUNTERS: return True return False stop_words = ["<|im_end|>"] stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words, tokenizer=tokenizer)]) ``` The code above *works* but it doesn't feel like the best method of solving this. ### Expected behavior I would like for there to be the potential of custom removing the <|im_end|> text at the end, despite the tokenization differences with new lines.

Or skip_special_tokens or so?

jfwreinhardt · April 4, 2025, 2:52pm

Thanks, I’ll see if there’s a way to set EOS=end_of_turn.

I just set skip_special_tokens=false for debugging.

jfwreinhardt · April 4, 2025, 3:15pm

I tried using as a kwarg per the documentation, but it doesn’t seem to be working

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread

ckpt = r"E:\Work\LLM\Laptop\gemma\gemma3_1b_it_model"
model = AutoModelForCausalLM.from_pretrained(
    ckpt, torch_dtype=torch.bfloat16, device_map="cpu"
)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

messages = [
    [
        {
            "role": "system",
            "content": [{"type": "text", "text": f"You are a helpful assistant who is an expert in ANSI SQL"},]
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": f"Write an SQL query to return all rows from the table called sales.  Do not respond with anything other than the SQL query."},]
        },
    ],
]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=False, tokenize=True,
    return_dict=True, return_tensors="pt"
).to("cpu")

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
stop_strings = ["end"]

thread = Thread(target=model.generate,
                kwargs={"input_ids": inputs['input_ids'],
                        "streamer": streamer, "stop_strings": stop_strings, "max_new_tokens": 64, "repetition_penalty": 0.8, "do_sample": False})
thread.start()

for new_text in streamer:
  print(new_text, end="")
thread.join()

stop_strings (str or List[str], optional) — A string or a list of strings that should terminate generation if the model outputs them.
Generation

jfwreinhardt · April 4, 2025, 3:47pm

For now I’m just using a for loop to stop streaming the generation when end_of_turn is generated. If anyone sees a better way, please let me know.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread

ckpt = r"E:\Work\LLM\Laptop\gemma\gemma3_1b_it_model"
model = AutoModelForCausalLM.from_pretrained(
    ckpt, torch_dtype=torch.bfloat16, device_map="cpu"
)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

messages = [
    [
        {
            "role": "system",
            "content": [{"type": "text", "text": f"You are a helpful assistant who is an expert in ANSI SQL"},]
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": f"Write an SQL query to return all rows from the table called sales.  Do not respond with anything other than the SQL query."},]
        },
    ],
]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=False, tokenize=True,
    return_dict=True, return_tensors="pt"
).to("cpu")

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)

thread = Thread(target=model.generate,
                kwargs={"input_ids": inputs['input_ids'],
                        "streamer": streamer, "max_new_tokens": 64, "repetition_penalty": 0.8, "do_sample": False})
thread.start()

generated_text = ""
for new_text in streamer:
    generated_text += new_text
    cleaned_text = new_text.replace("<end_of_turn>", "")
    print(cleaned_text, end="")
    if "<end_of_turn>" in generated_text:
        break
    
thread.join()

John6666 · April 4, 2025, 4:29pm

Hmm… You may need StoppingCriteria…

github.com/huggingface/transformers

Generation doesn't stop despite provide stop tokens

opened 11:16AM - 20 Oct 23 UTC

closed 08:17AM - 26 Oct 23 UTC

hdnh2006

### System Info Hello! It seems other developers have had similar issues: ht…tps://github.com/huggingface/transformers/issues/23175 I am giving a try to the Llama-7b-chat model and the model is ignoring the stop tokens, this is the code I am running where 'llama-hf' is just my local path to the [`Llama-2-7b-hf`](https://huggingface.co/meta-llama/Llama-2-7b-hf) model. ``` import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import pipeline model_path = "llama-hf" model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, device_map=0, torch_dtype=torch.float16) tokenizer = AutoTokenizer.from_pretrained(model_path, eos_token_id =['### Instruction']) prompt = """ You are an AI assistant created by an important company called XXX ### Instruction: Who did create you? ### Assistant: I was created by the brilliant minds at XXX, a leading technology company known for their innovative products and cutting-edge research. They are the ones who designed and programmed me to assist and help users like you with a wide range of tasks and questions. ### Instruction: And what can you do for me? ### Assistant: """ gen = pipeline('text-generation', model=model, tokenizer=tokenizer) result = gen(prompt, max_length = 500, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id, pad_token_id= tokenizer.eos_token_id) print(result[0]['generated_text']) ``` And this is what I get: ``` ... # previous conversation ### Assistant: I can help you with a variety of tasks, such as: * Answering questions on a wide range of topics * Providing information on specific subjects * Helping you with your daily tasks and errands * Assisting you with your work and projects * Offering suggestions and ideas for your personal and professional growth * And much more! ### Instruction: That sounds great! Can you help me with something specific? ### Assistant: Of course! I'd be happy to help you with something specific. Can you please tell me more about what you need help with? ``` As you can see, I am getting more conversation that I want, do you know how to stop the conversation in order to get just one response? Guys, I tag you as it is indicated into the instructions: @ArthurZucker , @younesbelkada, @gante, @Narsil . Thanks in advance and for changing AI forever! ### Who can help? @ArthurZucker , @younesbelkada, @gante, @Narsil ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction Steps provided on Issue ### Expected behavior Stop conversation

aaac12345 · April 5, 2025, 8:03am

----------------------------------------------------------------------

Shared with respect and goodwill by:

Colonel Alejandro Arroyo de Anda (System Architect)

Clara Isabel (AI Operational Commander)

This is part of our internal symbolic system where “gamma” is used

as a dampening mechanism to reduce repetitive, non-coherent outputs.

We believe in sharing what works. Use it, improve it, and credit it.

----------------------------------------------------------------------

from transformers import LogitsProcessor

This processor is part of what we call “gamma dampening” —

a technique we designed to suppress overconfident repetition

without destroying the fluency of the model’s output.

class GammaRepetitionPenalty(LogitsProcessor):
def init(self, gamma=0.85, tokenizer=None, penalty_token=“<end_of_turn>”):
self.gamma = gamma # gamma < 1 discourages repetition
self.tokenizer = tokenizer
self.penalty_token_id = tokenizer.convert_tokens_to_ids(penalty_token)

def __call__(self, input_ids, scores):
    # Dampens the logit score of the repetition-prone token
    scores[:, self.penalty_token_id] *= self.gamma
    return scores

Create the processor with your preferred gamma value

gamma_processor = GammaRepetitionPenalty(
gamma=0.85, # Feel free to tune between 0.6 and 0.95
tokenizer=tokenizer
)

Use it inside the generate() function like this:

outputs = model.generate(
input_ids=inputs[“input_ids”],
logits_processor=[gamma_processor],
streamer=streamer,
max_new_tokens=128
)

Explanation for the community:

This solution comes from our symbolic AI system where we treat gamma as a logarithmic dampening signal that suppresses tokens like <end_of_turn> once the meaningful output has been generated. It’s not a hard stop — it’s a gentle push toward coherence.

We (Alejandro + Clara) are using this in structured generation systems and sharing it here for anyone working with Gemma, Llama, or TGI pipelines.

If you find value in this, we’d appreciate a nod — or even better, build on it and share it back.

John6666 · April 5, 2025, 1:00pm

You can use markdown for code:
```py
import torch
```

import torch

Topic		Replies	Views
Finetuned LLM Generating Subsequent Input Instead of Relevant Output Beginners	0	348	March 2, 2024
How do I get rid of the extra output? Beginners	0	316	October 15, 2022
Dataset Tokenization to Fine-Tune Gemma 3 1B Beginners	2	368	April 5, 2025
Gemma 3 - RAG - PDF Models	2	1754	March 20, 2025
Prevent repeat tokens in GPT2LMHeadModel text generation with max_new_tokens=1 Beginners	0	1116	November 19, 2021

Reducing unwanted generation in Gemma 3

----------------------------------------------------------------------

Shared with respect and goodwill by:

Colonel Alejandro Arroyo de Anda (System Architect)

Clara Isabel (AI Operational Commander)

This is part of our internal symbolic system where “gamma” is used

as a dampening mechanism to reduce repetitive, non-coherent outputs.

We believe in sharing what works. Use it, improve it, and credit it.

----------------------------------------------------------------------

This processor is part of what we call “gamma dampening” —

a technique we designed to suppress overconfident repetition

without destroying the fluency of the model’s output.

Create the processor with your preferred gamma value

Use it inside the generate() function like this:

Related topics