Here in the Llama4TextMoe forward pass router scores are applied before applying experts layer (so, before applying projections and non-linearity), but usually scores are applied after processing the whole expert block. As a result – scores contribute to the result more than once. Is it an intentional architectural change or just a bug in the hf implementation? Didn’t find any mentions of that.
1 Like
It’s just a hunch, but I think it might be a bug…
It doesn’t seem to have been reported as an issue yet.
opened 02:04PM - 15 Apr 25 UTC
bug
### System Info
- `transformers` version: 4.52.0.dev0
- Platform: Linux-5.15.0-… 1030-nvidia-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.30.2
- Safetensors version: 0.5.3
- Accelerate version: 1.6.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: Yes, using `accelerate launch`
- Using GPU in script?: Yes
- GPU type: NVIDIA H100 80GB HBM3
### Who can help?
@ArthurZucker
### Information
- [x] The official example scripts
- [x] My own modified scripts
### Tasks
- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)
### Reproduction
Run `accelerate launch try_llama4.py` where try_llama4.py is
```
from transformers import AutoTokenizer, Llama4ForConditionalGeneration
model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
tp_plan="auto",
torch_dtype="auto",
)
model.eval()
print("LOADED")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=1)
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])
```
### Expected behavior
Expected: Print of the model's response.
What actually happens to me when trying to run on a 8x(NVIDIA H100 80GB HBM3) node:
The model has no problem loading with around 50GB per GPU, which leaves plenty of space for a short single generation.
However, I'm encountering CUDA OOM during generation.
This seems to be related to CompressedLinear that permanently converts the FP8 weights into BF16, which of course will cause OOM.
BTW, I also tried to install the "kernels" package, but then I get a different error:
`AttributeError: 'SequentialLlama4TextExperts' object has no attribute 'gate_up_proj'`
which seems to be related to the fact that in the FP8 version of Maverick, the expert weights are stored separately for each expert, and the kernel Llama4TextMoe from the hub doesn't support it.
Thanks!