Structuring chat histories while also mitigating more than one chatbot response

I’m having a hard time getting a model to only respond to a user prompt a single time. I’m now using mistral-7b-v1.0, but the issue is obviously in my dataset.

The issue is to me clearly related to the fact that my training data includes chat exchanges with multiple participants and often more than only two messages, thus it is responding multiple times.

I’ve added metadata in the training data and my prompts in an attempt to really drive home to the model that I want it to respond as if it were the last message in an exchange, but the extra metadata is confusing the model or not working at all depending on the value of do_sample being False or True, respectively.

Here is what input_ids might look like for a given training input (without the hyphens):

————
[Topic: this is to add context to longer conversations that are sliced up and to make connections across a variety of sundered training data inputs]

[Users: Alex, Marissa]

[Message Count: 4]

Alex (1): blah blah
Marissa (2): blah blah
Alex (3): blah blah
Marissa (4) (last):
————-

And here is what a label may look like:

————
blah blah
[#some_relevant_hashtag]
————

And then we prompt the LLM by providing it a message like this:

————
[Topic: blah blah]

[Users: SomeUsername, ChatBot]

[Message Count: 2]

SomeUsername (1): blah blah
ChatBot (2) (last):
————

But after playing around with a bunch of things, despite adding in the message count, the individual message numbers, labeling the last username label as “last”, and a few other things I’ve tried, I can’t get this darn thing to answer only one time reliably. Also I’ve never had it generate a hashtag successfully a single time, and I’ve tried a bunch of things, like putting the tag before the ai response, I’ve used this format with and without the square brackets, and I’ve even tried using JSON strings instead of the screenplay-like format.

One annoying aspect is I am not sure if I am just not using enough data, but I don’t want to run some massive test using a larger percentage of our data until I’m confident I’m not wasting VM time, as we are using an Azure nc24, which is pricey. I want the format to work well using only 10K-40K inputs, but is this dumb? Maybe I just need to use more data.

Obviously I could just slice off any extra inputs, but I don’t want to waste generation power on this, I’d rather that go to making an INDIVIDUAL response better, not by making some follow-up response I’m just gonna slice off.

I know I could just make my training data only consist of these 2-message exchanges, but I want the data to have a richer and deeper context for the model to learn from and so want to include these longer exchanges.

Adding the message count and the “(last)” tag was added today and I was hopeful this would help, but alas, nope. And I’m bummed I can’t generate a damn tag. I tried using a simpler “Tag: some tag” format, but it didn’t work either and I thought it was maybe thinking this was a person talking and not detecting it as metadata, so wanted a more unique string pattern. I also use two new line chars to distinguish metadata from chat data, but no dice.

What should I do?