Text Generation response truncation

I am having an issue where all responses are truncated, no matter which model or where they are hosted. Other questions on this topic have referenced required changes in langchain modules, but we are not using langchain.

For reference, I am getting the same error from:
databricks/dolly-v2-3b run on databricks ML RT 13.3 (56 ram 8cpu GCP)
databricks/dolly-v2-3b run on HF free inference using huggingface_hub library
databricks/dolly-v2-7b huggingface inference endpoint Nvidia A10G using recommended sample input
tiiuae/falcon-7b huggingface inference endpoint Nvidia A10G using recommended sample input

In case of HF inference endpoints: I get the same error using python requests as I do with the sample ui. Setting for inference endpoint are default: max input tokens: 1024 max tokens:1025. Screenshot below:

This problem exists in other modes to. Here is a test on inference endpoint, queried from Databricks(GCP) using requests after setting the inference endpoint to text-to-text:

and the same but after setting inference endpoint to summarization mode:

Hey, @cCaldwell! Have you been able to solve this issue? I have started to have the same problem while using the Llama 2 70b model.

same issue with 7b

1 Like

I’m also facing same issue :expressionless: on all the models I’ve tried.
Does anyone have the solution for this??

1 Like

I am using chat completion with Microsofts SemanticKernel and get the same result. Usually the responses are truncated at 100 tokens.

I have two questions, can I raise the bar above 100 tokens and is there a way to programmatically detect that the response are truncated?