The simplest case is that you forgot to call apply_chat_template?
Youâre printing many blanks because most early âtokensâ decode to nothing after your filters. Youâre feeding raw chat markup and tags. The streamer removes the prompt and special tokens during decode, so several iterations yield "". Your loop still prints the markers for each "", so you see ###$$$ many times.
Why this happens
- Chat markup and special tokens get stripped
Your text contains ChatML-style control tokens and tags:
<|im_start|>user
hello
<|im_end|>
<|im_start|>assistant
<think>
</think>
With skip_special_tokens=True, the tokenizer drops control tokens on decode. skip_prompt=True makes the streamer ignore the prompt part. Early decode steps often become empty strings. You still print them. That renders as repeated ###$$$. This is expected when feeding chat models raw markup while also skipping specials. (Hugging Face)
-
The streamer buffers until it has âdisplayable wordsâ
TextIteratorStreamer accumulates token pieces and only emits text when decoding forms complete, displayable spans. This can delay or suppress output for subword fragments. Combined with your skip filters, several iterations produce "". (Hugging Face)
-
Subword and whitespace tokens donât always produce visible characters
BPE/SentencePiece commonly produce leading spaces or fragments. Until a boundary is closed, decode can be empty or only whitespace. Your print makes empties visible as ###$$$. The streamerâs documented behavior is to emit when words materialize, not at every raw token. (Hugging Face)
-
Multiple EOS and chat tags at the boundary
Modern chat models often use more than one stop token (for example, <|end_of_text|> and <|im_end|> or <|eot_id|>). If you donât stop at all relevant EOS tokens, the model may output extra headers or newlines that get stripped to "". Transformers supports a list for eos_token_id. Use it so generation ends cleanly at the first relevant stop. (Hugging Face)
-
You hand-wrote the conversation instead of using the chat template
Most chat models expect a template. Hand-rolled markup can make the model emit scaffolding tokens first, which your decode then strips. apply_chat_template(..., add_generation_prompt=True) produces the exact format expected by that modelâs tokenizer and cleanly marks where assistant output should begin. That reduces spurious blanks. (Hugging Face)
What to change
A. Use chat templates instead of manual <|im_start|> strings
# Docs:
# - https://huggingface.co/docs/transformers/en/chat_templating
# - https://huggingface.co/docs/transformers/en/main_classes/text_generation
messages = [{"role": "user", "content": "hello"}]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True, # assistant starts here
return_tensors="pt"
).to("cuda")
from transformers import TextIteratorStreamer
streamer = TextIteratorStreamer(
tokenizer,
skip_prompt=True,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
# Provide multiple EOS to end cleanly on any of them.
stop_tokens = ["<|im_end|>", "<|eot_id|>", "<|end_of_text|>"]
eos_list = []
for tok in stop_tokens:
tok_id = tokenizer.convert_tokens_to_ids(tok)
if tok_id is not None and tok_id != tokenizer.pad_token_id:
eos_list.append(tok_id)
gen_kwargs = dict(
input_ids=input_ids,
max_new_tokens=256,
do_sample=True, temperature=0.7, top_p=0.8, top_k=20,
eos_token_id=eos_list or tokenizer.eos_token_id,
streamer=streamer,
)
from threading import Thread
Thread(target=model.generate, kwargs=gen_kwargs).start()
for chunk in streamer:
if not chunk: # ignore empty deltas
continue
print("###" + chunk + "$$$", end="") # no extra newline
Rationale: the template matches training-time formatting and sets the assistant start boundary. That avoids leading control tokens and reduces empty emissions. EOS as a list handles multi-stop models. (Hugging Face)
B. Donât print empty or whitespace-only chunks
Minimal and effective:
for chunk in streamer:
if not chunk or chunk.isspace():
continue
print("###" + chunk + "$$$", end="")
This keeps your markers readable and removes ###$$$ lines created by "" and pure "\n" deltas. Behavior aligns with the streamerâs âemit wordsâ design. (Hugging Face)
C. Debug once with specials enabled
To confirm the root cause in your environment, run a single test with:
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)
for chunk in streamer:
print(repr(chunk), end="") # show \n and specials explicitly
Youâll see leading control tokens and newlines that were previously erased by skip_special_tokens=True. This verifies why "" showed up. (Hugging Face)
D. If your model emits tags like <think>...</think>
Add stop strings or stop IDs for those tags. Many users combine eos_token_id=[...] with substring stoppers to prevent internal tags reaching the UI:
# Example: custom stopper for substrings
from transformers import StoppingCriteria, StoppingCriteriaList
class StopOnStrings(StoppingCriteria):
def __init__(self, stop_strings): self.stop_strings = stop_strings
def __call__(self, input_ids, scores, **kwargs):
text = tokenizer.decode(input_ids[0], skip_special_tokens=True)
return any(s in text for s in self.stop_strings)
stops = StoppingCriteriaList([StopOnStrings(["<think>", "</think>"])])
model.generate(..., stopping_criteria=stops, streamer=streamer)
Reason: some chat variants print scaffolding before content. Stopping criteria ensure your loop never receives them. (Patterns vary by model; templates help most.) (Hugging Face)
Mental model
- Streamer receives token IDs.
- It decodes incrementally and emits only when there is displayable text.
- You asked it to skip prompt and specials, which turn many early iterations into
"".
- Your print shows each
"" as an empty payload between markers.
- Fix by providing proper chat formatting, giving all stop IDs, and skipping
"" in your UI loop. (Hugging Face)
Quick checklist
- Use
apply_chat_template(..., add_generation_prompt=True). (Hugging Face)
- Keep
skip_special_tokens=True, but ignore "" or whitespace chunks. (Hugging Face)
- Provide all relevant EOS IDs. Many chat models need more than one. (Hugging Face)
- Optional:
clean_up_tokenization_spaces=False for exact spacing during streaming. (Hugging Face)
- If needed, add substring stoppers for tags like
<think>. Use minimally.
Curated references and similar cases
Official docs
- Transformers streaming utilities. Describes word-boundary emission and streamer parameters like
skip_prompt and skip_special_tokens. Useful to understand why empties happen. (Hugging Face)
- Generation API. Confirms
eos_token_id accepts a list for multiple stops. Key for chat models. (Hugging Face)
- Chat templates guide. Shows
apply_chat_template(..., add_generation_prompt=True) and why templates prevent formatting mismatches. (Hugging Face)
- HF blog on chat templates. Explains training-time formats and why hand-rolled prompts degrade behavior. (Hugging Face)
- Tokenizer docs on
skip_special_tokens. Confirms that specials are omitted from decode. (Hugging Face)
Issues and threads with analogous symptoms
- Beginners thread: delays and chunking behavior when streaming. Reinforces word-level emission and the need to run
generate in a background thread with a streamer. (Hugging Face Forums)
- Discussion on multi-EOS not stopping cleanly for some chat models. Motivation for passing all stop IDs. (GitHub)
- FastAPI discussion showing iterator streaming patterns and why to skip empty chunks in the loop. Useful for server implementations. (GitHub)
Model-specific chat template notes
- Qwen chat docs show
apply_chat_template usage and streaming patterns consistent with the above. Good cross-check if you test Qwen-style templates. (Qwen)