I fine-tuned OPT 350M to create a model that extracts addresses from natural text. For example:
The leased property is located at 3500 S Gessner Rd Ste 200, where the tenant will have access to the premises for the duration of the lease agreement.
&& 3500 S GESSNER RD STE 200 ...
I pushed my model to the HF hub, here: piazzola/address-detection-model · Hugging Face. When I use the model from the Python interpreter, I am able to generate more than one token at a time to get the output. I would like people to be able to do this using the Hosted Inference API available on the model card page linked above. However, I noticed that if I click the “Compute” button, I only see one token generated each time, and have to keep clicking to get the full expected continuation.
How can I change the behavior of the Hosted Inference API available on the model card, so that it generates more than one token at a time, like using the pipeline in the Python interpreter does?