Mixtral batch inference or in general fast inference

Hi,

The pipeline is not ideal for batched generation, it’s better to leverage the AutoModelForCausalLM class yourself as explained here: How to generate texts in huggingface in a batch way? · Issue #10704 · huggingface/transformers · GitHub.

We also recently added some new documentation around generation with LLMs: Generation with LLMs. It includes a section on batched generation.

And we also just updated the Mixtral docs as using Flash Attention gives you big boosts in performance :slight_smile: Mixtral. However, note that all of this is still done in Python.

If you want to put LLMs in production, then one typically doesn’t use plain Transformers, but rather frameworks such as: