Mixtral batch inference or in general fast inference

thank you for the useful example!

it is still not working as expected, for example. the generate method doesn’t have a way to only output the newly generated tokens.

my usage is mostly research based, are there any recommendations for that case?
running LLMs locally, mostly for inference but also fine-tuning, on a cluster with multiple GPUs?

are there any supporting packages or useful repos for research environments?

1 Like