Mixtral batch inference or in general fast inference

nielsr · February 25, 2024, 8:21pm

Hi,

The pipeline is not ideal for batched generation, it’s better to leverage the AutoModelForCausalLM class yourself as explained here: How to generate texts in huggingface in a batch way? · Issue #10704 · huggingface/transformers · GitHub.

We also recently added some new documentation around generation with LLMs: Generation with LLMs. It includes a section on batched generation.

And we also just updated the Mixtral docs as using Flash Attention gives you big boosts in performance Mixtral. However, note that all of this is still done in Python.

If you want to put LLMs in production, then one typically doesn’t use plain Transformers, but rather frameworks such as:

Topic		Replies	Views
Recommended way to perform batch inference for generation 🤗Transformers	0	2541	March 6, 2021
Batch inference using open source LLMs 🤗Transformers	1	2051	August 30, 2023
What's the best way to speed up inference on a large dataset? Beginners	3	3928	March 13, 2022
How to make single-input inference faster? Create my own pipeline? 🤗Transformers	9	3961	August 26, 2021
Optimising performance non-standard systems 🤗Transformers	2	784	February 16, 2022

Mixtral batch inference or in general fast inference

Related topics