Hi!
I am trying to run many prompts and generate just one new token for each.
I am doing it like this:
samples = 100
batch_size = 20
for start in tqdm(range(0, samples, batch_size)):
batch_prompts = df.loc[start:start + batch_size-1, 'prompt'].tolist()
inputs = tokenizer(batch_prompts, padding=True, truncation=True, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=1)
df.loc[start:start + batch_size-1, 'answer_batch'] = list(tokenizer.batch_decode(outputs))
This gives me a slow speed of ~0.5 tokens/sec.
But when I just generate 100 tokens like this:
inputs = tokenizer(
"Orcas were not known to be drawn to mistral energy, but they were seen recently ",
return_tensors="pt",padding=True).to(device)
outputs = model.generate(
**inputs, max_new_tokens=100, use_cache=True, do_sample=False)
text = tokenizer.batch_decode(outputs)
It gives me around 8 tokens/sec.
So is there a way to process many 1-token-generation-prompts faster?
Thanks