Hello everyone, does anyone know the best way to stop text generation from an LLM when running it locally using Hugging Face? I’m not referring to setting a strict MAX TOKEN limit, but rather stopping it more naturally. It’s impressive how ChatGPT or Claude halt their generation in a smart way. How can this be achieved when running an LLM locally?
Parameters like max_new_tokens are very hard coded, often leading to either premature stopping or continuous generation. An ideal scenario would be for the LLM to naturally stop once it has adequately answered the user’s question.
Any suggestions? or Any working demo?
Thank you!