Speed when running many prompts

I am trying to run many prompts and generate just one new token for each.

I am doing it like this:

samples = 100
batch_size = 20
for start in tqdm(range(0, samples, batch_size)):
    batch_prompts = df.loc[start:start + batch_size-1, 'prompt'].tolist()
    inputs = tokenizer(batch_prompts, padding=True, truncation=True, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=1)
    df.loc[start:start + batch_size-1, 'answer_batch'] = list(tokenizer.batch_decode(outputs))

This gives me a slow speed of ~0.5 tokens/sec.

But when I just generate 100 tokens like this:

inputs = tokenizer(
    "Orcas were not known to be drawn to mistral energy, but they were seen recently ",
outputs = model.generate(
    **inputs, max_new_tokens=100, use_cache=True, do_sample=False)
text = tokenizer.batch_decode(outputs)

It gives me around 8 tokens/sec.

So is there a way to process many 1-token-generation-prompts faster?