Mixtral batch inference or in general fast inference

i struggle figuring out how to run batch inference with a mixtral model in a typical high performance GPU setup.

here is my current implementation:

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, padding=True)
    model = transformers.AutoModelForCausalLM.from_pretrained(
pipeline = transformers.pipeline(
        model=model, tokenizer=tokenizer,
        return_full_text=False,  # if using langchain set True
        # we pass model parameters here too
        temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
        top_p=0.15,  # select from top tokens whose probability add up to 15%
        top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
        max_new_tokens=4096,  # max number of tokens to generate in the output
        repetition_penalty=1.1  # if output begins repeating increase
from torch.utils.data import DataLoader
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=False)
train_results = process_dataset(train_loader, pipeline,output_parser)

def process_dataset(data_loader, pipe,output_parser):
    results = []
    for i,batch in enumerate(data_loader):
        output = pipe(batch['prompt'])
            for o in output:
                o = o[0]['generated_text']
                parsed_outputs = output_parser.parse(o).pairs
            except Exception as e:
                parsed_outputs = {question: float('nan') for question in questions_list}
        results.append(parsed_outputs)  # Assuming you want the first (or only) generated text
    return results

assume train_dataset is a pytorch dataset object that on initialization creates a dataset of prompts and stores them in memory, the the get item function collects the necessary prompts.

this code works but i don’t think that it does batch inference like it should.

any suggestions on what I’m doing wrong?


The pipeline is not ideal for batched generation, it’s better to leverage the AutoModelForCausalLM class yourself as explained here: How to generate texts in huggingface in a batch way? · Issue #10704 · huggingface/transformers · GitHub.

We also recently added some new documentation around generation with LLMs: Generation with LLMs. It includes a section on batched generation.

And we also just updated the Mixtral docs as using Flash Attention gives you big boosts in performance :slight_smile: Mixtral. However, note that all of this is still done in Python.

If you want to put LLMs in production, then one typically doesn’t use plain Transformers, but rather frameworks such as:

thank you for the useful example!

it is still not working as expected, for example. the generate method doesn’t have a way to only output the newly generated tokens.

my usage is mostly research based, are there any recommendations for that case?
running LLMs locally, mostly for inference but also fine-tuning, on a cluster with multiple GPUs?

are there any supporting packages or useful repos for research environments?

1 Like