Hi,
The pipeline is not ideal for batched generation, it’s better to leverage the AutoModelForCausalLM
class yourself as explained here: How to generate texts in huggingface in a batch way? · Issue #10704 · huggingface/transformers · GitHub.
We also recently added some new documentation around generation with LLMs: Generation with LLMs. It includes a section on batched generation.
And we also just updated the Mixtral docs as using Flash Attention gives you big boosts in performance Mixtral. However, note that all of this is still done in Python.
If you want to put LLMs in production, then one typically doesn’t use plain Transformers, but rather frameworks such as:
- TGI: Text Generation Inference
- vLLM: GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs
- TensorRT LLM: GitHub - NVIDIA/TensorRT-LLM: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
- SGLang: GitHub - sgl-project/sglang: SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.