i struggle figuring out how to run batch inference with a mixtral model in a typical high performance GPU setup.
here is my current implementation:
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, padding=True)
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=bfloat16,
device_map='auto',
pad_token_id=tokenizer.eos_token_id
)
pipeline = transformers.pipeline(
model=model, tokenizer=tokenizer,
do_sample=True,
return_full_text=False, # if using langchain set True
task="text-generation",
# we pass model parameters here too
temperature=0.1, # 'randomness' of outputs, 0.0 is the min and 1.0 the max
top_p=0.15, # select from top tokens whose probability add up to 15%
top_k=0, # select from top 0 tokens (because zero, relies on top_p)
max_new_tokens=4096, # max number of tokens to generate in the output
repetition_penalty=1.1 # if output begins repeating increase
)
from torch.utils.data import DataLoader
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=False)
train_results = process_dataset(train_loader, pipeline,output_parser)
def process_dataset(data_loader, pipe,output_parser):
results = []
for i,batch in enumerate(data_loader):
output = pipe(batch['prompt'])
try:
for o in output:
o = o[0]['generated_text']
parsed_outputs = output_parser.parse(o).pairs
except Exception as e:
parsed_outputs = {question: float('nan') for question in questions_list}
results.append(parsed_outputs) # Assuming you want the first (or only) generated text
return results
assume train_dataset is a pytorch dataset object that on initialization creates a dataset of prompts and stores them in memory, the the get item function collects the necessary prompts.
this code works but i don’t think that it does batch inference like it should.
any suggestions on what I’m doing wrong?