How to limit response to generated output only? Using ChatML

I’m new to using chatml, but I have successfully generated the response that I want.

However, I am struggling in my attempts to limit the response to the generated content.

Example:

You are helpful robot.
<|im_end|>
<|im_start|>user
Mary had a little ...
<|im_end|>
<|im_start|>assistant

… will produce a response that looks something like:

You are helpful robot.
<|im_end|>
<|im_start|>user
Mary had a little ...
<|im_end|>
<|im_start|>assistant
Mary had a little lamb.

What I am hoping to do is for the response to be limited to only the generated output, without any of the input e.g.,:

lamb.

I can of course handle this after I get the response back, but for a variety of reasons I’d like to limit the response if possible (mostly because the input context is massive). I’ve found a few suggestions online but nothing seems to be working.

Also, I haven’t been able to find any definitive overview or documentation regarding ChatML… so maybe this is already clarified in docs that I’m just not aware of?

Have learned this is a known bug in the specific model I am using.

Still haven’t found a workaround but will update if I do.

I set return_full_text = False and it worked

do you think it is possible to set a limit on the number of characters in the response? For example, what if you set a limit of 10,000 characters? I’ll try to solve this problem too, and if it works, I’ll give you an answer.

With chat models this is tricky, I have been trying to limit the generated output without having to reduce the max output tokens

I have noted with instruct models, you can add the instructions and limits you want on the prompt instructions , however for just base chat models … I still doing research on this

I am testing out Granite Models