Unable to generate more than one token at a time using website API

piazzola · October 13, 2023, 5:55pm

I fine-tuned OPT 350M to create a model that extracts addresses from natural text. For example:

Input: The leased property is located at 3500 S Gessner Rd Ste 200, where the tenant will have access to the premises for the duration of the lease agreement.

Generated text: && 3500 S GESSNER RD STE 200 ...

I pushed my model to the HF hub, here: piazzola/address-detection-model · Hugging Face. When I use the model from the Python interpreter, I am able to generate more than one token at a time to get the output. I would like people to be able to do this using the Hosted Inference API available on the model card page linked above. However, I noticed that if I click the “Compute” button, I only see one token generated each time, and have to keep clicking to get the full expected continuation.

How can I change the behavior of the Hosted Inference API available on the model card, so that it generates more than one token at a time, like using the pipeline in the Python interpreter does?

michaklsa8 · November 29, 2023, 4:53pm

Thanks for solution.

Topic		Replies	Views
How to return more tokens when calling the inference end point? Inference Endpoints on the Hub	4	1494	May 9, 2024
Serverless Inference API Token Limits/Settings Beginners	2	156	November 26, 2024
Deploying to Model Hub for Inference with custom tokenizer Beginners	1	623	January 1, 2022
Cannot execute any model with my API Token, models are timed out Inference Endpoints on the Hub	6	2856	May 1, 2025
List models accessible via InferenceClient? Inference Endpoints on the Hub	1	69	April 9, 2025

Unable to generate more than one token at a time using website API

Related topics