Why does using `TextIteratorStreamer` result in so many empty outputs?

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
parameters = dict(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    max_new_tokens=256,  # Increase for longer outputs!
    temperature=0.7, top_p=0.8, top_k=20,  # For non thinking
    streamer=streamer
)
background_thread = Thread(target=model.generate, kwargs=parameters)
background_thread.start()

for new_token in streamer:
    print("###" + new_token + "$$$")

I use the code above, and got the result:

###$$$
###$$$
###Hello! $$$
###How $$$
###can $$$
###I $$$
###assist $$$
###you $$$
###$$$
###today? $$$
###$$$
###$$$
###😊$$$

the text is '<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n'

1 Like

The simplest case is that you forgot to call apply_chat_template?


You’re printing many blanks because most early “tokens” decode to nothing after your filters. You’re feeding raw chat markup and tags. The streamer removes the prompt and special tokens during decode, so several iterations yield "". Your loop still prints the markers for each "", so you see ###$$$ many times.

Why this happens

  1. Chat markup and special tokens get stripped
    Your text contains ChatML-style control tokens and tags:
<|im_start|>user
hello
<|im_end|>
<|im_start|>assistant
<think>

</think>

With skip_special_tokens=True, the tokenizer drops control tokens on decode. skip_prompt=True makes the streamer ignore the prompt part. Early decode steps often become empty strings. You still print them. That renders as repeated ###$$$. This is expected when feeding chat models raw markup while also skipping specials. (Hugging Face)

  1. The streamer buffers until it has “displayable words”
    TextIteratorStreamer accumulates token pieces and only emits text when decoding forms complete, displayable spans. This can delay or suppress output for subword fragments. Combined with your skip filters, several iterations produce "". (Hugging Face)

  2. Subword and whitespace tokens don’t always produce visible characters
    BPE/SentencePiece commonly produce leading spaces or fragments. Until a boundary is closed, decode can be empty or only whitespace. Your print makes empties visible as ###$$$. The streamer’s documented behavior is to emit when words materialize, not at every raw token. (Hugging Face)

  3. Multiple EOS and chat tags at the boundary
    Modern chat models often use more than one stop token (for example, <|end_of_text|> and <|im_end|> or <|eot_id|>). If you don’t stop at all relevant EOS tokens, the model may output extra headers or newlines that get stripped to "". Transformers supports a list for eos_token_id. Use it so generation ends cleanly at the first relevant stop. (Hugging Face)

  4. You hand-wrote the conversation instead of using the chat template
    Most chat models expect a template. Hand-rolled markup can make the model emit scaffolding tokens first, which your decode then strips. apply_chat_template(..., add_generation_prompt=True) produces the exact format expected by that model’s tokenizer and cleanly marks where assistant output should begin. That reduces spurious blanks. (Hugging Face)

What to change

A. Use chat templates instead of manual <|im_start|> strings

# Docs:
# - https://huggingface.co/docs/transformers/en/chat_templating
# - https://huggingface.co/docs/transformers/en/main_classes/text_generation

messages = [{"role": "user", "content": "hello"}]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,     # assistant starts here
    return_tensors="pt"
).to("cuda")

from transformers import TextIteratorStreamer
streamer = TextIteratorStreamer(
    tokenizer,
    skip_prompt=True,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

# Provide multiple EOS to end cleanly on any of them.
stop_tokens = ["<|im_end|>", "<|eot_id|>", "<|end_of_text|>"]
eos_list = []
for tok in stop_tokens:
    tok_id = tokenizer.convert_tokens_to_ids(tok)
    if tok_id is not None and tok_id != tokenizer.pad_token_id:
        eos_list.append(tok_id)

gen_kwargs = dict(
    input_ids=input_ids,
    max_new_tokens=256,
    do_sample=True, temperature=0.7, top_p=0.8, top_k=20,
    eos_token_id=eos_list or tokenizer.eos_token_id,
    streamer=streamer,
)

from threading import Thread
Thread(target=model.generate, kwargs=gen_kwargs).start()

for chunk in streamer:
    if not chunk:                   # ignore empty deltas
        continue
    print("###" + chunk + "$$$", end="")  # no extra newline

Rationale: the template matches training-time formatting and sets the assistant start boundary. That avoids leading control tokens and reduces empty emissions. EOS as a list handles multi-stop models. (Hugging Face)

B. Don’t print empty or whitespace-only chunks

Minimal and effective:

for chunk in streamer:
    if not chunk or chunk.isspace():
        continue
    print("###" + chunk + "$$$", end="")

This keeps your markers readable and removes ###$$$ lines created by "" and pure "\n" deltas. Behavior aligns with the streamer’s “emit words” design. (Hugging Face)

C. Debug once with specials enabled

To confirm the root cause in your environment, run a single test with:

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)
for chunk in streamer:
    print(repr(chunk), end="")  # show \n and specials explicitly

You’ll see leading control tokens and newlines that were previously erased by skip_special_tokens=True. This verifies why "" showed up. (Hugging Face)

D. If your model emits tags like <think>...</think>

Add stop strings or stop IDs for those tags. Many users combine eos_token_id=[...] with substring stoppers to prevent internal tags reaching the UI:

# Example: custom stopper for substrings
from transformers import StoppingCriteria, StoppingCriteriaList
class StopOnStrings(StoppingCriteria):
    def __init__(self, stop_strings): self.stop_strings = stop_strings
    def __call__(self, input_ids, scores, **kwargs):
        text = tokenizer.decode(input_ids[0], skip_special_tokens=True)
        return any(s in text for s in self.stop_strings)

stops = StoppingCriteriaList([StopOnStrings(["<think>", "</think>"])])
model.generate(..., stopping_criteria=stops, streamer=streamer)

Reason: some chat variants print scaffolding before content. Stopping criteria ensure your loop never receives them. (Patterns vary by model; templates help most.) (Hugging Face)

Mental model

  • Streamer receives token IDs.
  • It decodes incrementally and emits only when there is displayable text.
  • You asked it to skip prompt and specials, which turn many early iterations into "".
  • Your print shows each "" as an empty payload between markers.
  • Fix by providing proper chat formatting, giving all stop IDs, and skipping "" in your UI loop. (Hugging Face)

Quick checklist

  • Use apply_chat_template(..., add_generation_prompt=True). (Hugging Face)
  • Keep skip_special_tokens=True, but ignore "" or whitespace chunks. (Hugging Face)
  • Provide all relevant EOS IDs. Many chat models need more than one. (Hugging Face)
  • Optional: clean_up_tokenization_spaces=False for exact spacing during streaming. (Hugging Face)
  • If needed, add substring stoppers for tags like <think>. Use minimally.

Curated references and similar cases

Official docs

  • Transformers streaming utilities. Describes word-boundary emission and streamer parameters like skip_prompt and skip_special_tokens. Useful to understand why empties happen. (Hugging Face)
  • Generation API. Confirms eos_token_id accepts a list for multiple stops. Key for chat models. (Hugging Face)
  • Chat templates guide. Shows apply_chat_template(..., add_generation_prompt=True) and why templates prevent formatting mismatches. (Hugging Face)
  • HF blog on chat templates. Explains training-time formats and why hand-rolled prompts degrade behavior. (Hugging Face)
  • Tokenizer docs on skip_special_tokens. Confirms that specials are omitted from decode. (Hugging Face)

Issues and threads with analogous symptoms

  • Beginners thread: delays and chunking behavior when streaming. Reinforces word-level emission and the need to run generate in a background thread with a streamer. (Hugging Face Forums)
  • Discussion on multi-EOS not stopping cleanly for some chat models. Motivation for passing all stop IDs. (GitHub)
  • FastAPI discussion showing iterator streaming patterns and why to skip empty chunks in the loop. Useful for server implementations. (GitHub)

Model-specific chat template notes

  • Qwen chat docs show apply_chat_template usage and streaming patterns consistent with the above. Good cross-check if you test Qwen-style templates. (Qwen)
1 Like

I used the chat_template instead of hand-write, I did not say it


I still got so many empty output even with skip_prompt=False and skip_special_tokens=False

1 Like

This time, regardless of the specific cause, it’s likely an avoidable issue.

But generally speaking, whether using Transformers’ Streamer or not, LLM raw outputs often retain leftover template elements and other garbage. Manual post-processing is frequently necessary. Therefore, I recommend processing with the expectation that you will perform post-processing yourself.


Root cause: you are printing newline-only and zero-delta chunks. TextIteratorStreamer decodes tokens incrementally and only yields user-displayable text when it decides a “word” or span is formed. Early generation from chat models usually includes headers, control tokens, and several \n. Even with skip_prompt=False and skip_special_tokens=False, many yielded chunks are just "\n" or spaces, which your print("###" + new + "$$$") renders as apparently empty lines. This is expected behavior for the streamer and for chat-templated inputs. (Hugging Face)

What is happening under the hood

  • Streamer buffering policy. The streamer processes incoming token IDs and emits text “as soon as [entire] words are formed.” That implies internal buffering and non-character granularity. Subword tokens and whitespace often accumulate without adding visible glyphs, so intermediate deltas are "" or "\n". (Hugging Face)

  • Chat templates add structure and newlines. apply_chat_template(..., add_generation_prompt=True) builds a prompt with role headers and separators. Many built-in templates put a newline before the assistant turn, and models often start their reply with one or more newlines. Those tokens stream out first and show up as blank prints. (Hugging Face)

  • Decode cleanup can hide deltas. Decoding may “clean up tokenization spaces,” collapsing artifacts and yielding the same visible string after a step. The streamer still iterates, but the delta you print is "". Disable cleanup to reduce zero-delta iterations while debugging. (Hugging Face)

  • Multiple stop tokens. Modern chat models use several EOS markers (e.g., <|eot_id|>, <|im_end|>, <|end_of_text|>). If you only stop on one, the model often appends headers or trailing newlines that become extra whitespace-only chunks at the end. Transformers allows a list for eos_token_id. (Hugging Face)

  • Your print adds more line breaks. print("###" + new + "$$$") writes a newline after the chunk. If new already equals "\n", the terminal sees two line breaks. Use end="" and filter whitespace-only spans. Community patterns and Q&A about “word-by-word vs sentence-by-sentence” streaming describe the same effect. (Stack Overflow)

Concrete fixes

  1. Filter whitespace-only and zero-delta chunks.
    Minimal change that removes the visual noise.
for part in streamer:
    if not part or part.isspace():
        continue
    print(part, end="", flush=True)  # no extra newline each step

This matches recommended streaming loops that skip empty strings from the iterator. (Stack Overflow)

  1. Disable decode cleanup during streaming.
streamer = TextIteratorStreamer(
    tokenizer,
    skip_prompt=False,
    skip_special_tokens=False,
    clean_up_tokenization_spaces=False,  # key for fewer zero-delta steps
)

The decode API documents these kwargs; turning cleanup off makes emitted text match raw decoding and reduces “no-op” emits. (Hugging Face)

  1. Stop cleanly on all relevant EOS tokens.
stop_tokens = ["<|eot_id|>", "<|im_end|>", "<|end_of_text|>"]
eos_list = [tid for tok in stop_tokens
            if (tid := tokenizer.convert_tokens_to_ids(tok)) is not None]

gen_kwargs = dict(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    max_new_tokens=256,
    do_sample=True, temperature=0.7, top_p=0.8, top_k=20,
    eos_token_id=eos_list or tokenizer.eos_token_id,
    streamer=streamer,
)

Passing a list for eos_token_id prevents trailing header/newline runs that decode to whitespace-only chunks. (Hugging Face)

  1. Optional: add substring stoppers for tags like <think> or double newlines.
    For models that emit scaffolding, add a StoppingCriteria that checks decoded text for "</think>" or "\n\n\n", etc. Documentation and examples show how to implement custom stoppers. (Hugging Face)

  2. Inspect the template and the actual stream once.

  • Print repr(part) for one run to confirm that most early chunks are "\n" or spaces.
  • Verify your template’s separators; some add blank lines by design. (Hugging Face)

Why you still saw empties even with skip_prompt=False and skip_special_tokens=False

  • Those flags only control whether the prompt and specials are removed on decode. They don’t alter the model’s tendency to start with \n or the streamer’s buffering. You’ll still receive many "\n" chunks and partials. The streamer itself is documented to emit text at word boundaries, not at every token, so you will observe iterations that carry no new visible glyphs. (Hugging Face)

Quick diagnostic checklist

  • Turn cleanup off: clean_up_tokenization_spaces=False. (Hugging Face)
  • Log repr(part) for a short run to see \n explicitly.
  • Filter if not part or part.isspace(): continue in the loop. (Stack Overflow)
  • Provide all EOS IDs. (Hugging Face)
  • Keep apply_chat_template(..., add_generation_prompt=True) so the assistant boundary is set. (Hugging Face)

Curated references

Core docs

  • Hugging Face chat templates guide: how apply_chat_template structures turns and why add_generation_prompt matters. Useful to reason about leading/trailing newlines. (Hugging Face)
  • Generation API: eos_token_id can be a list. Prevents trailing whitespace bursts. (Hugging Face)
  • Tokenizer decode options: skip_special_tokens and clean_up_tokenization_spaces. (Hugging Face)
  • Streamers: emit chunks when words are formed; iterator semantics. (Hugging Face)

Threads and examples

  • FastAPI “word-by-word vs sentence” streaming with TextIteratorStreamer. Confirms buffering and the need to adjust your loop. (Stack Overflow)
1 Like

I added clean_up_tokenization_spaces=False and eos_token_id=eos_list or tokenizer.eos_token_id, but the empty output does not change at all for the same test
emmm…
if model generate token_id for Hello、!、Ġ one by one (I find them three in vocab.json), the three can be decode into visible character easily, why it put two empty string into streamer?