Evalutation of expert router logits simultanous to generation

So I’m having an issue and cannot understand why I am getting a certain error. In my example, I want to call .generate and have it return:

  • The “normal” logits
  • The expert routing logits
  • The generated text

However, I can’t seem to do that simultanously. I seem to need to generate the text twice. Am I missing something here?

Setup code:

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from transformers import BitsAndBytesConfig
import torch

set_seed(1234)
local_dir = "./models/Phi-tiny-MoE-instruct"

torch.cuda.empty_cache()

model = AutoModelForCausalLM.from_pretrained(
    local_dir,
    torch_dtype="auto",
    device_map=generate_device_map(30, (1,2)),
    output_router_logits=False,  # Temp set for testing
)

tokenizer = AutoTokenizer.from_pretrained(local_dir, model_max_length=4096, padding=True, truncation=True, max_length=4096)

inputs = tokenizer("Hello world!", return_tensors="pt").to(model.device)

I’d like to do the following:

generation_output = model.generate(**inputs, return_dict_in_generate=True, output_logits=True, output_router_logits=True) # This should contain the router logits

generation_sequences = generation_output.sequences # This is the text, both prompt and generated
generation_logits = generation_output.logits # This is the logits for generation
router_logits = generation_output.router_logits # This is the logits for the layer that decides which experts things go to

But instead I need to do this it seems:

generation_output = model.generate(**inputs, return_dict_in_generate=True, output_logits=True)

generation_sequences = generation_output.sequences # This is the text, both prompt and generated
generation_logits = generation_output.logits # This is the logits for generation

model_output = model(input_ids=inputs['input_ids'], output_router_logits=True)

router_logits = model_output.router_logits

As otherwise, the first one errors out like so:

Loading weights: 485/? [00:01<00:00, 790.24it/s, Materializing param=lm_head.bias]

PhimoeForCausalLM LOAD REPORT from: ./models/Phi-tiny-MoE-instruct
Key                                                 | Status     | 
----------------------------------------------------+------------+-
model.layers.{0...31}.mlp.gate.weight               | UNEXPECTED | 
model.layers.{0...31}.input_layernorm.bias          | UNEXPECTED | 
model.layers.{0...31}.post_attention_layernorm.bias | UNEXPECTED | 
model.layers.{0...31}.mlp.router.weight             | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing form the checkpoint. Consider training on your downstream task.
Some parameters are on the meta device because they were offloaded to the cpu.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 49
     44 inputs = tokenizer("Hello world!", return_tensors="pt").to(model.device)
     46 ##################################
     47 ### This doesn't work! ###
     48 ##################################
---> 49 generation_output = model.generate(**inputs, return_dict_in_generate=True, output_logits=True, output_router_logits=True) # This should contain the router logits
     51 generation_sequences = generation_output.sequences # This is the text, both prompt and generated
     52 generation_logits = generation_output.logits # This is the logits for generation

File ~/miniconda3/envs/<env_name>/lib/python3.12/site-packages/torch/utils/_contextlib.py:120, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    117 @functools.wraps(func)
    118 def decorate_context(*args, **kwargs):
    119     with ctx_factory():
--> 120         return func(*args, **kwargs)

File ~/miniconda3/envs/<env_name>/lib/python3.12/site-packages/transformers/generation/utils.py:2678, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, use_model_defaults, custom_generate, **kwargs)
   2675 model_kwargs["use_cache"] = generation_config.use_cache
   2677 # 9. Call generation mode
-> 2678 result = decoding_method(
   2679     self,
   2680     input_ids,
   2681     logits_processor=prepared_logits_processor,
   2682     stopping_criteria=prepared_stopping_criteria,
   2683     generation_config=generation_config,
   2684     **generation_mode_kwargs,
   2685     **model_kwargs,
   2686 )
   2688 return result

File ~/miniconda3/envs/<env_name>/lib/python3.12/site-packages/transformers/generation/utils.py:2876, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)
   2874 if prefill_consumed:
   2875     model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
-> 2876     outputs = model_forward(**model_inputs, return_dict=True)
   2877 prefill_consumed = True
   2878 model_kwargs = self._update_model_kwargs_for_generation(
   2879     outputs,
   2880     model_kwargs,
   2881     is_encoder_decoder=self.config.is_encoder_decoder,
   2882 )

File ~/miniconda3/envs/<env_name>/lib/python3.12/site-packages/torch/nn/modules/module.py:1773, in Module._wrapped_call_impl(self, *args, **kwargs)
   1771     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1772 else:
-> 1773     return self._call_impl(*args, **kwargs)

File ~/miniconda3/envs/<env_name>/lib/python3.12/site-packages/torch/nn/modules/module.py:1784, in Module._call_impl(self, *args, **kwargs)
   1779 # If we don't have any hooks, we want to skip the rest of the logic in
   1780 # this function, and just call forward.
   1781 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1782         or _global_backward_pre_hooks or _global_backward_hooks
   1783         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1784     return forward_call(*args, **kwargs)
   1786 result = None
   1787 called_always_called_hooks = set()

File ~/miniconda3/envs/<env_name>/lib/python3.12/site-packages/accelerate/hooks.py:175, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    173         output = module._old_forward(*args, **kwargs)
    174 else:
--> 175     output = module._old_forward(*args, **kwargs)
    176 return module._hf_hook.post_forward(module, output)

File ~/miniconda3/envs/<env_name>/lib/python3.12/site-packages/transformers/utils/generic.py:768, in can_return_tuple.<locals>.wrapper(self, *args, **kwargs)
    766 if return_dict_passed is not None:
    767     return_dict = return_dict_passed
--> 768 output = func(self, *args, **kwargs)
    769 if not return_dict and not isinstance(output, tuple):
    770     output = output.to_tuple()

File ~/miniconda3/envs/<env_name>/lib/python3.12/site-packages/transformers/models/phimoe/modeling_phimoe.py:885, in PhimoeForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_router_logits, cache_position, logits_to_keep, **kwargs)
    883 aux_loss = None
    884 if output_router_logits:
--> 885     aux_loss = load_balancing_loss_func(
    886         outputs.router_logits,
    887         self.num_experts,
    888         self.num_experts_per_tok,
    889         attention_mask,
    890     )
    891     if labels is not None:
    892         loss += self.router_aux_loss_coef * aux_loss.to(loss.device)  # make sure to reside in the same device

File ~/miniconda3/envs/<env_name>/lib/python3.12/site-packages/transformers/models/phimoe/modeling_phimoe.py:779, in load_balancing_loss_func(gate_logits, num_experts, top_k, attention_mask)
    771 expert_attention_mask = (
    772     attention_mask[None, :, :, None, None]
    773     .expand((num_hidden_layers, batch_size, sequence_length, top_k, num_experts))
    774     .reshape(-1, top_k, num_experts)
    775     .to(compute_device)
    776 )
    778 # Compute the percentage of tokens routed to each experts
--> 779 tokens_per_expert = torch.sum(expert_mask.float() * expert_attention_mask, dim=0) / torch.sum(
    780     expert_attention_mask, dim=0
    781 )
    783 # Compute the mask that masks all padding tokens as 0 with the same shape of tokens_per_expert
    784 router_per_expert_attention_mask = (
    785     attention_mask[None, :, :, None]
    786     .expand((num_hidden_layers, batch_size, sequence_length, num_experts))
    787     .reshape(-1, num_experts)
    788     .to(compute_device)
    789 )

RuntimeError: The size of tensor a (32) must match the size of tensor b (30) at non-singleton dimension 0

Why does it generate this error?

1 Like

device_map=“auto”

1 Like

Thanks for the suggestion, but that does not solve my issue and I did manual mapping for a specific reason (that is, auto mode causes a cuda OOM error on my machine) :+1:

1 Like

In short, bug…?


You are not missing anything. The error comes from a known limitation/bug in how the PhiMoE auxiliary load-balancing loss is implemented when combined with .generate(...) and output_router_logits=True. During generation, the model only routes the new tokens, but the aux-loss code assumes router logits for the entire sequence and uses the full attention_mask length. That mismatch produces the 32 vs 30 size error.

Below is the detailed breakdown.


1. What you are asking the model to do

You are trying to get, in one .generate(...) call:

  1. The normal language-model logits (for token prediction).
  2. The router logits (the MoE gating scores).
  3. The generated text.

You tried:

generation_output = model.generate(
    **inputs,
    return_dict_in_generate=True,
    output_logits=True,
    output_router_logits=True,
)

Conceptually, this is reasonable:

  • output_logits=True → return token logits at each generation step.
  • output_router_logits=True → return router logits and enable MoE auxiliary loss. (Hugging Face)

But in the PhiMoE implementation, output_router_logits=True does more than just “return router logits”: it also forces computation of a load-balancing auxiliary loss that assumes training-like full-sequence shapes. That assumption is violated inside .generate(...).


2. Where the error actually happens

The relevant part of your stack trace:

File ... modeling_phimoe.py:885, in PhimoeForCausalLM.forward(...)
    aux_loss = load_balancing_loss_func(
        outputs.router_logits,
        self.num_experts,
        self.num_experts_per_tok,
        attention_mask,
    )

File ... modeling_phimoe.py:779, in load_balancing_loss_func(...)
    tokens_per_expert = torch.sum(expert_mask.float() * expert_attention_mask, dim=0) / torch.sum(
        expert_attention_mask, dim=0
    )

RuntimeError: The size of tensor a (32) must match the size of tensor b (30) at non-singleton dimension 0

So:

  • You call .generate(...) with output_router_logits=True.

  • PhimoeForCausalLM.forward(...) calls load_balancing_loss_func(...) with:

    • gate_logits = outputs.router_logits
    • attention_mask = <whatever generate built> (Hugging Face)
  • Inside load_balancing_loss_func, it tries to multiply expert_mask and expert_attention_mask along the first dimension.

  • That first dimension is 32 for one tensor and 30 for the other → crash.

The numbers 32 and 30 are not random; they reflect:

  • 32 = number of Transformer layers in PhiMoE (num_hidden_layers). (aidoczh.com)
  • 30 = a sequence length inferred from the attention_mask during that particular step.

So the core bug is: “number of (layers × tokens seen) inferred from router logits” doesn’t match “number of (layers × tokens inferred from attention_mask)”.


3. How load_balancing_loss_func is written

The function is defined roughly like this (simplified, from the Phi-3.5 MoE implementation, which Phi-tiny-MoE inherits architecturally): (Hugging Face)

def load_balancing_loss_func(gate_logits, num_experts=None, top_k=2, attention_mask=None):
    # gate_logits: tuple of length num_hidden_layers
    # each element: [batch_size * seq_length, num_experts]

    if gate_logits is None or not isinstance(gate_logits, tuple):
        return 0

    # 1. Concatenate router logits from all layers
    compute_device = gate_logits[0].device
    concatenated = torch.cat([layer_gate.to(compute_device) for layer_gate in gate_logits], dim=0)
    routing_weights = softmax(concatenated, dim=-1)

    # 2. Pick top-k experts and build a one-hot expert mask
    _, selected_experts = torch.topk(routing_weights, top_k, dim=-1)
    expert_mask = one_hot(selected_experts, num_experts)  # shape: [L*B*S, top_k, num_experts]

    if attention_mask is None:
        # no masking -> simple averages
        tokens_per_expert = expert_mask.float().mean(dim=0)
        router_prob_per_expert = routing_weights.mean(dim=0)
    else:
        batch_size, sequence_length = attention_mask.shape

        # *** Infer number of layers from concatenated size ***
        num_hidden_layers = concatenated.shape[0] // (batch_size * sequence_length)

        # 3. Build an expanded attention mask aligned with expert_mask
        expert_attention_mask = (
            attention_mask[None, :, :, None, None]
            .expand(num_hidden_layers, batch_size, sequence_length, top_k, num_experts)
            .reshape(-1, top_k, num_experts)
        )

        tokens_per_expert = (expert_mask.float() * expert_attention_mask).sum(dim=0) / expert_attention_mask.sum(dim=0)

        router_per_expert_attention_mask = (
            attention_mask[None, :, :, None]
            .expand(num_hidden_layers, batch_size, sequence_length, num_experts)
            .reshape(-1, num_experts)
        )
        router_prob_per_expert = (routing_weights * router_per_expert_attention_mask).sum(dim=0) / router_per_expert_attention_mask.sum(dim=0)

    overall_loss = (tokens_per_expert * router_prob_per_expert.unsqueeze(0)).sum()
    return overall_loss * num_experts

Key assumptions:

  1. gate_logits are a tuple of length num_hidden_layers.

  2. Each element has shape [batch_size * sequence_length, num_experts].

  3. Therefore, after concatenation:

    concatenated.shape[0] == num_hidden_layers * batch_size * sequence_length
    
  4. Given an attention_mask of shape [batch_size, sequence_length], they reconstruct num_hidden_layers by:

    num_hidden_layers = concatenated.shape[0] // (batch_size * sequence_length)
    
  5. Then they build expert_attention_mask by repeating the attention mask num_hidden_layers times and flattening, so its first dimension is:

    num_hidden_layers * batch_size * sequence_length
    
  6. They expect that to match the first dimension of expert_mask, which comes directly from concatenated.

This is perfectly consistent for training-style full-sequence forwards, where:

  • Every layer sees the full sequence [batch_size, sequence_length].
  • Router logits are computed for every token at every layer.
  • The attention mask matches the sequence length used to compute the router logits.

4. Why your manual forward works

When you do:

model_output = model(input_ids=inputs["input_ids"], output_router_logits=True)
router_logits = model_output.router_logits

two important things happen:

  1. You are doing a single full forward over the prompt (no cache, no generation loop).
  2. In your snippet you did not pass an attention_mask.

That means:

  • Each layer’s router runs over all prompt tokens.

  • gate_logits really do cover [batch * sequence_length] tokens per layer.

  • attention_mask is None in load_balancing_loss_func, so it uses the simple branch:

    • tokens_per_expert = expert_mask.mean(dim=0)
    • router_prob_per_expert = routing_weights.mean(dim=0)
  • No masking, no sequence-length inference, no mismatch.

So a direct model(...) call is fine because it does not enter the masked path that causes trouble.


5. What changes when you call .generate(...)

Now compare to:

generation_output = model.generate(
    **inputs,
    return_dict_in_generate=True,
    output_logits=True,
    output_router_logits=True,
)

What .generate does for a decoder-only model like PhiMoE: (Gitee)

  1. Prefill step (first iteration):

    • The model is called with the full prompt.
    • input_ids length = prompt_length.
    • past_key_values = None.
    • Router logits are computed for all prompt tokens in all layers.
    • attention_mask shape matches prompt_length.
    • load_balancing_loss_func is happy.
  2. Autoregressive decoding steps (subsequent tokens):

    • prepare_inputs_for_generation is called.

    • It constructs inputs so that:

      • input_ids now usually contain only the new, unprocessed tokens (often a single token).
      • past_key_values carries the cached key/value states for previous tokens.
      • attention_mask typically still has the full length for all tokens seen so far (prompt + generated).
    • The model’s MoE layers now compute router logits only for the new tokens.

    • So for each decoding step:

      • gate_logits cover only the new tokens (e.g., 1 token per layer).
      • attention_mask covers the entire history (e.g., 30 tokens).
  3. In PhimoeForCausalLM.forward, because output_router_logits=True, the code always calls load_balancing_loss_func(outputs.router_logits, ..., attention_mask). (Hugging Face)

    • gate_logits = tuple of length 32 (layers), each shaped [batch * 1, num_experts].
    • Concatenated: [32 * batch * 1, num_experts] → first dim = 32.
    • attention_mask.shape = [batch, sequence_length_total], say [1, 30] for this step.
    • batch_size, sequence_length = attention_mask.shape → batch_size=1, sequence_length=30.
  4. load_balancing_loss_func computes:

    num_hidden_layers = concatenated.shape[0] // (batch_size * sequence_length)
                      = 32 // (1 * 30)
                      = 1   # integer division
    

    This is already wrong: you actually have 32 layers × 1 token, not 1 layer × 30 tokens. But the function can’t know that; it’s inferring num_hidden_layers from inconsistent shapes.

  5. Then it creates expert_attention_mask:

    expert_attention_mask = (
        attention_mask[None, :, :, None, None]
        .expand(num_hidden_layers, batch_size, sequence_length, top_k, num_experts)
        .reshape(-1, top_k, num_experts)
    )
    

    With num_hidden_layers=1, batch_size=1, sequence_length=30:

    • Expanded shape first dim = 1 * 1 * 30 = 30.
  6. But expert_mask was created from routing_weights which came from concatenated:

    • First dim = concatenated.shape[0] = 32 (32 layers × 1 token × 1 batch).
  7. Now the line that crashes:

    tokens_per_expert = torch.sum(expert_mask.float() * expert_attention_mask, dim=0) / ...
    

    is trying to multiply tensors whose first dimension is:

    • expert_mask: 32
    • expert_attention_mask: 30

    That is exactly your error:

    RuntimeError: The size of tensor a (32) must match the size of tensor b (30)
    

So the root cause is:

  • During cached generation, router logits correspond to only the new tokens.
  • The attention_mask still describes all tokens so far.
  • load_balancing_loss_func assumes they describe the same grid (layers × batch × sequence_length).
  • The resulting mis-inference of num_hidden_layers makes the mask tensor size incompatible with the expert mask tensor.

6. This is a known pattern: Mixtral & other MoE models

This exact pattern has already shown up with other MoE models such as Mixtral:

  • GitHub issue “Mixtral inference breaks when output_router_logits=True.” (GitHub)

    • Users enable output_router_logits=True and call .generate(...).
    • They hit a nearly identical error inside load_balancing_loss_func: “size of tensor a must match size of tensor b…”.
    • The maintainers confirm that output_router_logits was intended for training (where you have full sequences and labels), not for inference/generation with caching.

Hugging Face model docs for several MoE architectures (Switch Transformers, NLLB-MoE, Qwen2-MoE, OLMoE, etc.) all describe output_router_logits as: (Hugging Face)

  • “Whether or not to return the logits of all the routers.”
  • “They are useful for computing the router loss.”
  • “They should not be returned during inference.”

PhiMoE’s config follows the same pattern:

  • The config has flags output_router_logits and router_aux_loss_coef (default 0.001). (aidoczh.com)
  • PhimoeForCausalLM.forward uses output_router_logits to decide whether to call load_balancing_loss_func(...) and include the MoE auxiliary loss in the total loss. (Hugging Face)

Your call to .generate(..., output_router_logits=True) is effectively forcing training-style routing loss computation into the inference path, which is exactly the scenario known to be fragile.


7. The “UNEXPECTED” / “MISSING” weights in your log

You also saw:

model.layers.{0...31}.mlp.gate.weight               | UNEXPECTED
model.layers.{0...31}.input_layernorm.bias          | UNEXPECTED
model.layers.{0...31}.post_attention_layernorm.bias | UNEXPECTED
model.layers.{0...31}.mlp.router.weight             | MISSING

This tells you:

  • The checkpoint you’re loading (SlimMoE / Phi-tiny-MoE) has parameter names like mlp.gate.weight.
  • The architecture class you’re using (PhimoeForCausalLM from transformers) expects mlp.router.weight etc. (Hugging Face)
  • So some parameters are “unexpected” (present in the checkpoint but not in the model) and “missing” (expected by the model but randomly initialized instead).

The Phi-tiny-MoE model card shows the recommended loading pattern: (Hugging Face)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-tiny-MoE-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,  # important for SlimMoE
)

That trust_remote_code=True ensures you use Microsoft’s SlimMoE implementation, which matches the checkpoint exactly.

However:

  • These load warnings are not what causes the 32 vs 30 shape error.
  • The shape error is purely from the load_balancing_loss_func + attention_mask + caching mismatch described above.
  • Fixing the architecture (via trust_remote_code=True) is still strongly recommended for correctness, but the MoE aux-loss incompatibility with .generate + output_router_logits=True remains unless you change the logic.

8. Answering your original questions directly

Why does it generate this error?

Because during generation:

  • The MoE router logits are computed only for the newest tokens (e.g., 1 token per step).

  • The attention mask still reflects the entire sequence (prompt + previous generations).

  • load_balancing_loss_func tries to reconstruct how many layers and tokens there are by assuming that:

    • gate_logits length = num_hidden_layers * batch_size * sequence_length, where sequence_length comes from the attention mask.
  • With cached decoding, that assumption is false: gate_logits cover fewer tokens than the attention mask says.

  • It therefore infers the wrong num_hidden_layers and builds expanded masks of incompatible size, leading to the 32 vs 30 mismatch when multiplying expert_mask and expert_attention_mask.

Am I missing something, or do I really need to generate twice?

You are not conceptually wrong; the implementation just wasn’t designed for exactly what you’re doing. Out of the box, with the current Hugging Face PhiMoE implementation:

  • A single .generate(..., output_router_logits=True) call is not safe with caching.
  • The “generate once, then separate forward for router logits” pattern you ended up with is the robust, supported way if you stay on stock code.

If you want everything in one pass, you need one of these workarounds:

  1. Disable caching during generation for analysis runs:

    generation_output = model.generate(
        **inputs,
        max_new_tokens=64,
        return_dict_in_generate=True,
        output_logits=True,
        output_router_logits=True,
        use_cache=False,  # important
    )
    

    Then every step recomputes the full sequence, so router logits and attention mask stay aligned. This is slower but often fine for short generations.

  2. Patch the model so that the MoE aux loss is only computed when labels are present (i.e., during training). For example, inside PhimoeForCausalLM.forward you can change:

    aux_loss = None
    if output_router_logits:
        aux_loss = load_balancing_loss_func(..., attention_mask)
        if labels is not None:
            loss += self.router_aux_loss_coef * aux_loss.to(loss.device)
    

    to something like:

    aux_loss = None
    if output_router_logits and labels is not None:
        aux_loss = load_balancing_loss_func(..., attention_mask)
        loss += self.router_aux_loss_coef * aux_loss.to(loss.device)
    

    Then output_router_logits=True in inference just returns the router logits without calling load_balancing_loss_func, avoiding the shape mismatch.

  3. Write a small custom generation loop that:

    • Calls the MoE base model directly with output_router_logits=True.
    • Applies model.lm_head manually to get token logits.
    • Never calls load_balancing_loss_func at all (you only log or store the raw router logits).

All of these let you get “normal logits + router logits + generated text” without re-generating the text twice, but they require stepping a bit outside the default .generate path.


9. Suggested further reading and references

A short, curated set of useful resources:

A. MoE routing and router logits

  • Switch Transformers docs (Hugging Face) – Clearly explain output_router_logits and emphasize that router logits are for computing the router loss and “should not be returned during inference.” (Hugging Face)
  • NLLB-MoE / Qwen2-MoE / OLMoE docs – Use almost identical wording for output_router_logits, reinforcing the “training-only” intent for router logits + aux loss. (Hugging Face)

Useful for seeing the design pattern across multiple MoE families.

B. PhiMoE / SlimMoE-specific

  • PhiMoE model docs in Transformers – Show the config fields (num_hidden_layers, num_experts_per_tok, output_router_logits, router_aux_loss_coef) and describe the PhiMoE architecture that Phi-tiny-MoE is based on. (Hugging Face)
  • modeling_phimoe.py on Hugging Face – Contains the actual load_balancing_loss_func and the PhimoeForCausalLM.forward logic that calls it; helpful to inspect the exact shape operations. (Hugging Face)
  • Phi-tiny-MoE model card – Shows the recommended way to load the model (trust_remote_code=True) and provides background on SlimMoE and the compressed MoE architecture. (Hugging Face)

Good for aligning your code with the intended loading / architecture.

C. Similar issues and MoE routing debugging

  • GitHub issue: “Mixtral inference breaks when output_router_logits=True” – Documents almost the same crash inside load_balancing_loss_func and discusses how router logits are supposed to be used only in training. (GitHub)
  • MoE architecture overviews / tutorials – For intuition on how experts, routers, and auxiliary losses work (e.g. Switch Transformers and general MoE explainers). (Hugging Face)

These will give you more context for interpreting and, if you like, customizing the router behaviour.


So in summary:

  • The error is caused by a shape mismatch inside the MoE load-balancing loss when you enable output_router_logits=True during cached generation.
  • The mismatch arises because router logits are computed for only the new tokens, while attention_mask still covers the full sequence, violating the loss function’s assumption.
  • Your current “generate once + separate forward for router logits” pattern is the safe way with stock code; alternative single-pass solutions require disabling cache or patching the model’s aux-loss logic.
1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.