I am having an issue where all responses are truncated, no matter which model or where they are hosted. Other questions on this topic have referenced required changes in langchain modules, but we are not using langchain.
For reference, I am getting the same error from:
databricks/dolly-v2-3b run on databricks ML RT 13.3 (56 ram 8cpu GCP)
databricks/dolly-v2-3b run on HF free inference using huggingface_hub library
databricks/dolly-v2-7b huggingface inference endpoint Nvidia A10G using recommended sample input
tiiuae/falcon-7b huggingface inference endpoint Nvidia A10G using recommended sample input
In case of HF inference endpoints: I get the same error using python requests as I do with the sample ui. Setting for inference endpoint are default: max input tokens: 1024 max tokens:1025. Screenshot below: