Batch inference using open source LLMs

galprz · June 8, 2023, 6:19pm

Hey, i am trying to perform batch inference using oasst-sft-7-llamba-30b (open assistent model but i don’t think it is really related to the model’s type) and i cannot get it to work with batch>1 if i set the batch size to more than 1 it just output low quality text (compre to batch=1) here is the code that i use:

import bitsandbytes as bnb
load_8bit: bool = True
base_model: str = "/storage/oasst-sft-7-llama-30b"
prompt_template: str = "oasst"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=load_8bit,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(base_model)
generator = pipeline('text-generation', max_length=512,  model=model, tokenizer=tokenizer, device=0,
                     batch_size=4,  num_beams=1, top_k=40, top_p=0.1, temperature=0.0)
for outputs in tqdm(generator(KeyDataset(dataset, "text" ))):
    print([out["generated_text"] for out in outputs])

shubhamagarwal92 · August 30, 2023, 1:39am

Hi @galprz !

Did you find a solution for this?

Topic		Replies	Views
Mixtral batch inference or in general fast inference Beginners	2	3984	February 26, 2024
Conversational pipeline by huggingface transformer taking too long to generate output 🤗Transformers	0	841	September 27, 2023
Accelerating inference for local HuggingFacePipeline of Llama3 🤗Transformers	0	89	August 1, 2024
Model is getting loaded unevenly with AutomodelforCasualLM 🤗Transformers	0	5	July 16, 2024
How to generate multiple text completions per prompt (like vLLM) using HuggingFace Transformers Pipeline without triggering an error? Beginners	4	2691	May 12, 2024

Batch inference using open source LLMs

Related topics