Llama4 routing scores

fdrose · May 23, 2025, 3:29pm

Here in the Llama4TextMoe forward pass router scores are applied before applying experts layer (so, before applying projections and non-linearity), but usually scores are applied after processing the whole expert block. As a result – scores contribute to the result more than once. Is it an intentional architectural change or just a bug in the hf implementation? Didn’t find any mentions of that.

John6666 · May 23, 2025, 3:50pm

It’s just a hunch, but I think it might be a bug…

It doesn’t seem to have been reported as an issue yet.

github.com/huggingface/transformers

CUDA OOM when running meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8

opened 02:04PM - 15 Apr 25 UTC

ido-deci

bug

### System Info - `transformers` version: 4.52.0.dev0 - Platform: Linux-5.15.0-…1030-nvidia-x86_64-with-glibc2.35 - Python version: 3.10.12 - Huggingface_hub version: 0.30.2 - Safetensors version: 0.5.3 - Accelerate version: 1.6.0 - Accelerate config: not found - DeepSpeed version: not installed - PyTorch version (GPU?): 2.6.0+cu124 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using distributed or parallel set-up in script?: Yes, using `accelerate launch` - Using GPU in script?: Yes - GPU type: NVIDIA H100 80GB HBM3 ### Who can help? @ArthurZucker ### Information - [x] The official example scripts - [x] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below) ### Reproduction Run `accelerate launch try_llama4.py` where try_llama4.py is ``` from transformers import AutoTokenizer, Llama4ForConditionalGeneration model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8" tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True) model = Llama4ForConditionalGeneration.from_pretrained( model_id, tp_plan="auto", torch_dtype="auto", ) model.eval() print("LOADED") outputs = model.generate(**inputs.to(model.device), max_new_tokens=1) outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:]) print(outputs[0]) ``` ### Expected behavior Expected: Print of the model's response. What actually happens to me when trying to run on a 8x(NVIDIA H100 80GB HBM3) node: The model has no problem loading with around 50GB per GPU, which leaves plenty of space for a short single generation. However, I'm encountering CUDA OOM during generation. This seems to be related to CompressedLinear that permanently converts the FP8 weights into BF16, which of course will cause OOM. BTW, I also tried to install the "kernels" package, but then I get a different error: `AttributeError: 'SequentialLlama4TextExperts' object has no attribute 'gate_up_proj'` which seems to be related to the fact that in the FP8 version of Maverick, the expert weights are stored separately for each expert, and the kernel Llama4TextMoe from the hub doesn't support it. Thanks!

Topic		Replies	Views
Discrepancy Between Theoretical and Measured FLOPs/token for LLaMA-4 Scout 17B (MoE) Models	0	60	April 23, 2025
Llama-2 CUDA OOM during inference but not training Models	2	589	July 10, 2024
Num_experts_per_tok for MoE models Models	0	151	May 6, 2024
Llama2-70b-chat loading Cuda Out of Memory Models	0	1216	July 26, 2023
Llama3 so much slow compared to ollama 🤗Transformers	15	10097	February 28, 2025

Llama4 routing scores

Related topics