Multiple responses with async generate in TGI

I am using the generate() function from AsyncClient from Text Generation Inference t query a bunch of models. If I want to generate more than one candidate responses for a given prompt, how do I do that?
I see a n parameter (number of responses to generate) in the chat() function but none in the generate() function.
The naive method I am working with right now is to just repeat the same prompt multiple times, but I am wondering if there is a better way?
On a related note, what is the difference between using these two functions?

Hey @shaily99,

For the generate function, you can pass the best_of arg to the TGI client and that’ll make it return multiple candidates given your input prompt. For example:

response = await client.generate(
    prompt,
    do_sample=True,
    best_of=4,
)

# this output is the highest probability response, which HF assumes is the "best"
best = output.generated_text

# you can get the other candidates like this
other_candidates = [seq.generated_text for seq in output.details.best_of_sequences]

# and just combine all of them
all_candidates = [best] + other_candidates

By default, TGI only lets you generate up to 2 candidates. But when you’re starting up TGI, you can pass the --max-best-of arg (reference in the docs) if you want more. For example:

docker run \
    --rm \
    -it \
    --gpus '"device=0"' \
    -p $port \
    ghcr.io/huggingface/text-generation-inference:2.0.1 \
    --model-id $model_path \
    --sharded false \
    --dtype bfloat16 \
    --max-best-of 4 \ # <-- set this to whatever you want

On a related note, what is the difference between using these two functions?

I’ve never used the chat function before and didn’t even know it existed until now, but it looks to me like the chat function is intended specifically for chat-based models where you have dialog turns. And it lets you pass a list of Message objects in and then the TGI code will handle formatting that into the prompt string given to the model.

In contrast, the generate function expects the prompt to already be formatted correctly and ready to go.

So generate is more general purpose than chat, but requires a little work on your end because you have to format the inputs correctly yourself