SFT Conversation llama3-8b-Instruct fails with assistant_only_loss=True

Hello,

I am trying to do a supervised Fine Tune using amy own dataset, I am stuck on this error:
RuntimeError: You’re using assistant_only_loss=True, but at least one example has no assistant tokens. This usually means the tokenizer’s chat template doesn’t generate assistant masks — it may be missing the {% generation %} keyword. Please check the template and ensure it’s correctly configured to support assistant masking.

Here is my tokenized conversation sample:

{‘messages’: [{‘role’: ‘user’, ‘content’: ‘Hi everyone! Please help with this API. Context: adding new URLs to an existing custom category. Could anybody explain the difference between "url"and "dbcategorizedurls"in this context, please?’}, {‘role’: ‘assistant’, ‘content’: ‘It’s the difference between URL’s retaining categories and those over-riding the category.’}], ‘formatted_chat’: {‘assistant_masks’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], ‘input_ids’: [128000, 128006, 882, 128007, 271, 13347, 5127, 0, 1988, 9002, 1901, 5987, 5446, 13, 9805, 25, 7999, 502, 36106, 311, 459, 6484, 2587, 5699, 13, 16910, 21739, 10552, 279, 6811, 1990, 330, 1103, 1, 438, 330, 15245, 58711, 21141, 55501, 420, 2317, 11, 4587, 30, 578, 2930, 2027, 9904, 320, 16564, 279, 5446, 5905, 8, 3250, 956, 3493, 904, 16540, 922, 420, 13, 11361, 304, 12178, 0, 128009, 128006, 78191, 128007, 271, 2181, 596, 279, 6811, 1990, 5665, 596, 51110, 11306, 323, 1884, 927, 12, 50780, 279, 5699, 13, 11361, 13678, 1131, 779, 36106, 3984, 304, 279, 330, 1103, 1, 798, 19625, 6857, 690, 14389, 904, 1023, 5699, 814, 617, 320, 258, 5369, 311, 279, 2587, 5699, 8, 24797, 369, 36106, 3984, 304, 279, 330, 15245, 58711, 21141, 1, 906, 2351, 927, 12, 50780, 904, 1023, 22824, 2065, 3984, 555, 1901, 5987, 323, 690, 617, 279, 2587, 5699, 439, 872, 353, 3323, 9, 5699, 7366, 4741, 1131, 2209, 420, 4495, 30, 350, 5987, 0, 7566, 11, 499, 2351, 30230, 902, 315, 1521, 43212, 701, 7999, 2288, 25, 11361, 0, 5112, 1701, 279, 330, 1103, 1, 798, 19625, 6857, 304, 279, 5446, 374, 13890, 311, 7999, 279, 5665, 311, 279, 2587, 5699, 1701, 279, 330, 10480, 36106, 1, 2054, 304, 279, 16840, 11, 1314, 30, 5659, 279, 1520, 3189, 83, 279, 330, 9342, 36106, 1, 2054, 25, 721, 6403, 279, 5665, 1161, 8, 499, 1390, 311, 923, 311, 420, 4194, 9342, 5699, 4194, 438, 4299, 4194, 9, 2261, 19974, 20517, 8572, 36106, 527, 1193, 9960, 555, 10396, 430, 5905, 420, 3230, 2587, 5699, 13, 1789, 3187, 11, 422, 499, 3810, 366, 1277, 1129, 2185, 18771, 916, 91, 2185, 18771, 916, 29, 1618, 11, 433, 690, 912, 5129, 387, 9960, 555, 10396, 430, 5905, 8267, 30064, 11, 1202, 2748, 5665, 5699, 1462, 358, 4510, 779, 11, 499, 2643, 1205, 311, 1296, 433, 11, 743, 709, 1772, 1543, 11, 2713, 264, 5743, 36106, 1518, 1405, 814, 4363, 13, 10619, 11, 1690, 9523, 0, 128009]}}

What am I missing?

Stack trace from python:

  File "/home/ubuntu/train_llama3.py", line 417, in doing
    trainer = SFTTrainer(
              ^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/trl/trainer/sft_trainer.py", line 855, in __init__
    train_dataset = self._prepare_dataset(
                    ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/trl/trainer/sft_trainer.py", line 1090, in _prepare_dataset
    dataset = dataset.map(
              ^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 562, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3343, in map
    for rank, done, content in Dataset._map_single(**unprocessed_kwargs):
  File "/home/ubuntu/.local/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3675, in _map_single
    for i, example in iter_outputs(shard_iterable):
  File "/home/ubuntu/.local/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3649, in iter_outputs
    yield i, apply_function(example, i, offset=offset)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3572, in apply_function
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/trl/trainer/sft_trainer.py", line 1082, in tokenize_fn
    raise RuntimeError(
RuntimeError: You're using `assistant_only_loss=True`, but at least one example has no assistant tokens. This usually means the tokenizer's chat template doesn't generate assistant masks — it may be missing the `{% generation %}` keyword. Please check the template and ensure it's correctly configured to support assistant masking.
1 Like

Seems some possible causes:


Why you get this error with your sample

Your tokenized example already contains the smoking gun:

  • assistant_masks is all zeros → TRL sees no assistant tokens → it raises the RuntimeError when assistant_only_loss=True. (GitHub)

Background (what TRL is relying on):

  • TRL’s assistant_only_loss=True depends on the tokenizer returning an assistant-token mask (assistant_masks), which is produced by tokenizer.apply_chat_template(..., return_assistant_tokens_mask=True). (Hugging Face)
  • Transformers will only produce a non-zero assistant mask if the chat template explicitly marks assistant spans using the Jinja {% generation %} block. If the template doesn’t support it, the mask comes back as all zeros. (Hugging Face)

So you are not missing an “assistant message” in your dataset—you are missing assistant span markers in the template (or a working masking path).


The two most common causes (both match real reports)

Cause 1 — Your tokenizer’s chat_template does not include {% generation %}

This is the most common reason, and it’s explicitly called out in the docs and multiple model/template PRs. (Hugging Face)

Cause 2 — Llama-3 family masking bug / mismatch in some setups

Even when you try to use return_assistant_tokens_mask=True, some Llama-3.x configurations have been reported to return incorrect masks (all zeros) compared to Llama-2. (GitHub)


Step 1: Confirm which cause you have (minimal repro)

Run this in the same environment you train in:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

msgs = [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there"},
]

enc = tok.apply_chat_template(
    msgs,
    tokenize=True,
    return_dict=True,
    return_assistant_tokens_mask=True,
    add_generation_prompt=False,
)

print("has_generation_tag:", "{% generation %}" in (tok.chat_template or ""))
print("assistant_ones:", sum(enc["assistant_tokens_mask"]) if "assistant_tokens_mask" in enc else sum(enc["assistant_masks"]))

Interpretation:

  • has_generation_tag=False and assistant_ones=0 → template cannot produce masks. (Hugging Face)
  • has_generation_tag=True but assistant_ones=0 → you are likely hitting the masking-path issue seen in reports for Llama-3.x. (GitHub)
  • assistant_ones>0 here, but training still fails → some dataset rows lose assistant tokens (typically truncation). (GitHub)

Step 2: Fix the template (recommended fix if Cause 1)

You want a Llama-3 prompt format plus {% generation %} around assistant content.

Meta’s Llama-3 Instruct prompt format uses <|start_header_id|>…<|end_header_id|> and end-of-turn markers. (llama.com)

Example template that adds generation tags:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

tok.chat_template = r"""
{% for message in messages %}
{% if loop.index0 == 0 %}{{ bos_token }}{% endif %}
{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
{% if message['role'] == 'assistant' %}
{% generation %}{{ message['content'] | trim }}{% endgeneration %}{{ '<|eot_id|>' }}
{% else %}
{{ message['content'] | trim }}{{ '<|eot_id|>' }}
{% endif %}
{% endfor %}
{% if add_generation_prompt %}
{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{% endif %}
"""

Then rerun the minimal repro. If assistant_ones becomes > 0, TRL’s assistant_only_loss=True will stop erroring for that reason. (Hugging Face)

Important Llama-3-specific pitfall: wrong default template

Some Llama-3 repos/configs have been observed to ship a “generic HF template” (e.g., using <|im_start|>…) rather than the expected Llama-3 header format; overriding the template avoids silent mismatches. (Hugging Face)


Step 3: Guard against truncation creating “no assistant tokens” (common hidden culprit)

Even if masks work, TRL can still fail if the assistant part is truncated away by max_seq_length:

  • If assistant tokens start after max_length, assistant_masks becomes all zeros for that row. (GitHub)

Practical check (find bad rows before training):

def has_assistant_tokens(example, tok, max_len=4096):
    enc = tok.apply_chat_template(
        example["messages"],
        tokenize=True,
        return_dict=True,
        return_assistant_tokens_mask=True,
        truncation=True,
        max_length=max_len,
        add_generation_prompt=False,
    )
    m = enc.get("assistant_tokens_mask", enc.get("assistant_masks"))
    return 1 in m

bad = [i for i, ex in enumerate(train_dataset) if not has_assistant_tokens(ex, tok)]
print("bad rows:", len(bad), "first few:", bad[:10])

If bad is non-empty:

  • increase max_seq_length, or
  • truncate earlier turns (system/user) more aggressively than assistant, or
  • drop/repair those rows.

Step 4: If masks stay broken on Llama-3.x (Cause 2), use a robust workaround

There are documented cases where return_assistant_tokens_mask=True behaves unexpectedly for Llama-3.x. (GitHub)
If your minimal repro still yields all-zero masks after adding {% generation %}, the most reliable options are:

Workaround A — Prompt/Completion dataset (completion-only loss) + disable assistant masks

TRL supports prompt–completion training; by default it computes loss only on completion tokens. (Hugging Face)
You generate:

  • prompt: formatted conversation up to the assistant header
  • completion: assistant answer (and end token if desired)

This completely sidesteps assistant-token masking.

Workaround B — Manually build labels (assistant-only) in preprocessing

This recreates “assistant-only loss” without depending on tokenizer masks:

  • labels = input_ids.copy()
  • set non-assistant spans to -100

This is exactly the original motivation behind adding assistant-mask support in Transformers. (GitHub)


Configuration pitfalls to avoid (so your fix actually applies)

  • Packing and masks: there are reports of masking behavior changing/breaking with packing-related flows and similar settings. If you’re debugging, start with packing=False. (GitHub)
  • Kernel/config interactions: there are configurations where assistant_masks can be discarded/ignored, making loss computation differ from what you expect. (GitHub)

What you are “missing” in one sentence

You have assistant messages, but your pipeline is not producing a usable assistant-token mask (because the chat template lacks {% generation %} or the masking path is broken), and TRL refuses to run assistant_only_loss=True when assistant_masks contains no 1s. (Hugging Face)

Wow thank you very much for this VERY complete response. You got me going in the right direction, the issue turned out to be that my tokenizer’s chat_template is not being used during the training run, yes it is used during data prep, but transformers does its own verification that was failing because the chat_template used did not contain this line: {% generation %}{{ message['content'] | trim }}{% endgeneration %}{{ '<|eot_id|>' }} I resolved it by saving my jinja template to a file and loading it in the SFTConfig using parameter chat_template_path='./my_chat_template.jinja'
I will update the thread once I determine what needs to be the correct jinja template.

1 Like