Mixtral batch inference or in general fast inference

GonRos22 · February 26, 2024, 2:50pm

thank you for the useful example!

it is still not working as expected, for example. the generate method doesn’t have a way to only output the newly generated tokens.

my usage is mostly research based, are there any recommendations for that case?
running LLMs locally, mostly for inference but also fine-tuning, on a cluster with multiple GPUs?

are there any supporting packages or useful repos for research environments?

Topic		Replies	Views
Recommended way to perform batch inference for generation 🤗Transformers	0	2541	March 6, 2021
Batch inference using open source LLMs 🤗Transformers	1	2051	August 30, 2023
What's the best way to speed up inference on a large dataset? Beginners	3	3928	March 13, 2022
How to make single-input inference faster? Create my own pipeline? 🤗Transformers	9	3961	August 26, 2021
Optimising performance non-standard systems 🤗Transformers	2	784	February 16, 2022

Mixtral batch inference or in general fast inference

Related topics