Multiple responses with async generate in TGI

shaily99 · April 23, 2024, 7:15pm

I am using the generate() function from AsyncClient from Text Generation Inference t query a bunch of models. If I want to generate more than one candidate responses for a given prompt, how do I do that?
I see a n parameter (number of responses to generate) in the chat() function but none in the generate() function.
The naive method I am working with right now is to just repeat the same prompt multiple times, but I am wondering if there is a better way?
On a related note, what is the difference between using these two functions?

dblakely · April 23, 2024, 8:12pm

Hey @shaily99,

For the generate function, you can pass the best_of arg to the TGI client and that’ll make it return multiple candidates given your input prompt. For example:

response = await client.generate(
    prompt,
    do_sample=True,
    best_of=4,
)

# this output is the highest probability response, which HF assumes is the "best"
best = output.generated_text

# you can get the other candidates like this
other_candidates = [seq.generated_text for seq in output.details.best_of_sequences]

# and just combine all of them
all_candidates = [best] + other_candidates

By default, TGI only lets you generate up to 2 candidates. But when you’re starting up TGI, you can pass the --max-best-of arg (reference in the docs) if you want more. For example:

docker run \
    --rm \
    -it \
    --gpus '"device=0"' \
    -p $port \
    ghcr.io/huggingface/text-generation-inference:2.0.1 \
    --model-id $model_path \
    --sharded false \
    --dtype bfloat16 \
    --max-best-of 4 \ # <-- set this to whatever you want

On a related note, what is the difference between using these two functions?

I’ve never used the chat function before and didn’t even know it existed until now, but it looks to me like the chat function is intended specifically for chat-based models where you have dialog turns. And it lets you pass a list of Message objects in and then the TGI code will handle formatting that into the prompt string given to the model.

In contrast, the generate function expects the prompt to already be formatted correctly and ready to go.

So generate is more general purpose than chat, but requires a little work on your end because you have to format the inputs correctly yourself

Topic		Replies	Views
Default parameters when querying models with TGI Intermediate	0	349	April 23, 2024
Generating Once for 16 Tokens is Not Same Generating Single Token 16 Times? 🤗Transformers	4	279	April 17, 2024
Returning Multiple Questions for a Question Generation Model on SageMaker Amazon SageMaker	0	310	January 16, 2023
TGI Model Question 🤗Hub	0	371	September 21, 2023
TextGeneration Inference Model Inference Endpoints on the Hub	2	437	September 22, 2023

Multiple responses with async generate in TGI

Related topics