Llama4 routing scores

Here in the Llama4TextMoe forward pass router scores are applied before applying experts layer (so, before applying projections and non-linearity), but usually scores are applied after processing the whole expert block. As a result – scores contribute to the result more than once. Is it an intentional architectural change or just a bug in the hf implementation? Didn’t find any mentions of that.

1 Like

It’s just a hunch, but I think it might be a bug…

It doesn’t seem to have been reported as an issue yet.