Questions about ordering training inputs when fine-tuning models

Context:
I am working on a project where I am trying to build a chatbot trained on industry-specific data.
This is my first experience with anything touching ML and I am learning as I go.


Challenge of using group-chat data to train a standard chatbot:
One interesting challenge for this project is that I am using data from industry-specific message board conversations where there are mostly more than two users chatting in an unstructured, unpredictable pattern, but attempting to train a chatbot that will conform to a traditional, back-and-forth structure with only two users, namely the user and the AI.

To reduce the risk of the model generating text from multiple perspectives as well as responding too many times, I am including metadata in the training data and the prompts that the model completes.


Training Input Examples:
Here is an example of a training input with only two users:

{"chat": {"users": ["David", "Robert"], "msg_count": 2, "messages": [{"user": "David", "text": "Example message 1"}, {"user": "Robert", "text": "Example message 2"}]}}

But again, it is very important to note that there are mostly scenarios with more than two users chatting, as below.

{"chat": {"users": ["David", "Robert", "Max"], "msg_count": 4, "messages": [{"user": "David", "text": "Example message 1"}, {"user": "Robert", "text": "Example message 2"}, {"user": "David", "text": "Example message 3"}, {"user": "Max", "text": "Example message 4"}]}}

Although I condense consecutive messages from the same user into a single message so users can not speak more than once in a row, naturally there is no predictable pattern in the order that users respond, and so this is why I am including the array of users in the metadata outside of the messages array, so that the LLM will not respond as an irrelevant user, as well as including the msg_count to prevent the LLM from responding too many times.


Tokenizing data

I am tokenizing the training dataset as below, where training_files is a list of JSONL files (the order of these files is important as described later on below):

train_dataset = load_dataset('json', data_files=training_files)
train_dataset = train_dataset.map(
        lambda examples: tokenizer(json.dumps(examples['chat']),
                                   padding=True,
                                   truncation=True))
train_dataset = train_dataset ['train']

Completing Prompts
Note that I am simply tokenizing json.dumps() on the key-value chat (as opposed to extracting data and building a string using special tokens like ###Input: or ###Output: or whatever) with the intention of completing prompts as below:

INPUT:
{"users": ["Tod", "AI"], "msg_count": 2, "messages": [{"user": "Tod", "text": "Tod's prompt"}, {"user": "AI", "text": " ← JSON string is cut off here

COMPLETED OUTPUT:
{"users": ["Tod", "AI"], "msg_count": 2, "messages": [{"user": "Tod", "text": "Tod's prompt"}, {"user": "AI", "text": "AI response"}]}

I know that this is not a standard way of doing things, but I am including the extra metadata in an attempt to address the below:

  1. Prevent the model from responding as irrelevant users (users array), given that the training data mostly includes more than two users chatting.
  2. Prevent the model from responding too many times (msg_count value), given that the model may otherwise complete messages too many times, and/or speak for the user in addition to the AI.
  3. Capture a message history, as opposed to one-off questions & answers.
  4. It is convenient to have the data in JSON format so it can more easily be parsed and manipulated in between API calls, though obviously it will require validation.

So the idea is that by including the users array and the msg_count, the AI will respond only as the AI and only the correct amount of times.


Notes on the number of inputs, models, and environment:
The training dataset includes about one million training inputs, but I have only been able to test with about one thousand in Google Colab because of the environment limitations there, but soon I am going to be testing this in a proper environment with all 1M inputs. Most of my testing has been with falcon-7b and so far the results are promising.


I promise there is a question coming and if you’ve read this far I really appreciate it! I am going into detail because in addition to having some specific questions, I’d love to get people’s thoughts on this process in general.


Layering the training data
During the data harvesting process, I am writing chats with only two users to individual files, so the LLM can be fine-tuned with this data first. Later on, the chats with more than two users are used, as the chat.messages[i].text fields still contain valuable industry-specific sequences.

There are actually five layers to the training data:

  1. General Knowledge: Manually created training inputs, covering “general industry knowledge” as well as basic company info that is not covered in the message boards, like elementary industry knowledge, support questions like “I am having trouble with my account”, etc. These are always structured with two users, and I am also including manners, such as greeting the user at the start of the message as well as inviting the user to keep chatting. I am using GPT-4 to help create varied versions of each input to increase the size of this layer. Also, users[1] is always the AI in order to hopefully give these inputs more weight since in production users[1] will always be the AI, but in the real training data, the users array is always varied. These inputs are first so the LLM can get a basic idea of the JSON structure as well as hopefully imprint these mannerisms when users[1] is the AI.

  2. Two Users: All training inputs where there are only two users in the chat. The idea is that the LLM will be be more weighted towards these conversations that fit our use case, where only one user and the AI are chatting.

  3. Message History Growth: Every training input from layer two is represented here again, except we show the messages array growing two to three times, consecutively and starting from a random index. The idea here is that the LLM will see how messages grows in general as well as in relation to the msg_count value. The messages array can sometimes have thousands of replies, and so we break them into multiple inputs so each JSON structure does not exceed the context length of a particular model so the messages arrays contain at most ~30 elements depending on their sizes, though often they’re much, much smaller; however, to avoid over-fitting the model on these inputs, that is why we only show them growing two to three times and start at a random index so the entirety of the data is not repeated in full. We always use an even msg_count to represent a user-AI pair of messages.

Example:
{"chat": {"users": ["David", "Robert"], "msg_count": 2, "messages": [{"user": "David", "text": "Example message 1"}, {"user": "Robert", "text": "Example message 2"}]}}
{"chat": {"users": ["David", "Robert"], "msg_count": 4, "messages": [{"user": "David", "text": "Example message 1"}, {"user": "Robert", "text": "Example message 2"}, {"user": "David", "text": "Example message 3"}, {"user": "Robert", "text": "Example message 4"}]}}
{"chat": {"users": ["David", "Robert"], "msg_count": 6, "messages": [{"user": "David", "text": "Example message 1"}, {"user": "Robert", "text": "Example message 2"}, {"user": "David", "text": "Example message 3"}, {"user": "Robert", "text": "Example message 4"}, {"user": "David", "text": "Example message 5"}, {"user": "Robert", "text": "Example message 6"}]}}
  1. Gradient Layer: Instead of having an immediate change from only two users in the chat to the scenarios where there are more than two users in the chat (and sometimes only one if a single message exceeds the context length), we have a layer here where we mix training inputs from the two-users scenarios and the not-two-users scenarios, and we also include the same growth pattern as in layer 3, except we scramble all of the inputs so the growth is not represented consecutively, with the idea that the growth pattern will be more abstractly represented here.

  2. Not Two Users: We show this data last, where there are not only two users in the chat. Since this is the bulk of our data, we show this last after the model has (hopefully) thoroughly learned from the two-users scenarios. The idea here is that even though it is not structured exactly as our use case requires, the message content itself contains industry-specific sequences that are still valuable, and we mitigate the structural issues by including the extra metadata as explained previously.


Side questions

  • Do you agree that it is reasonable to layer the data in this way in an attempt to mitigate the issue of the majority of data involving more than two users?

  • Do you think Layer 2 is a good idea, or too much of an over-fitting risk? It also increases the size of our two-users dataset, which is ideal, right?

  • Do you think Layer 4 is a good idea, or is it fine to go from two users to more than two users without blending everything.

  • Do you think one million inputs will be enough to make a dent in falcon-7b for a truly impressive chatbot? Ideally the chatbot will be more adept at answering questions related to the industry (which is dentistry, by the way) than GPT-4, which is a pretty tall order. It is being trained on 20-years-worth of dentists and dental professionals chatting online, plus podcast transcripts of dentists and dental professionals chatting, as well as hundreds of dental magazine articles that have been pretty well formatted to a chat structure.

  • Dumb question, but do you think falcon-40b would be significantly better than falcon-7b for being able to chat about advanced dental topics? Or will it be overkill? I am also interested in MPT-30b mostly for the large context length, but I am concerned with the extra cost/time training and generating with these models. I have yet to successfully train MPT-7b in Colab, and I have gotten LLaMA-2-7b to run and generate output, but it doesn’t appear to actually be learning from the data so I am doing something wrong, but haven’t had much time to work with it much.


Here are my two real questions:

  1. How can I ensure the Trainer class does not shuffle all my training data, since I am intentionally wanting to train the model in the ordered layers described above.
  2. Given we “only” have one million training inputs, I don’t want to lose 10% of the data to create a validation split, not to mention I am not exactly sure how to do this, but have been lead to believe it is probably more effective to include this.

Question #1: avoid shuffling the train split:
Is this a fine way to set the data_collator to avoid shuffling inputs? Or is there a better approach?
Note that I already have logic in place to shuffle the data during the harvest. I figure that instead of running multiple epochs, I can re-harvest the data in between trainings, this way the order of the layers can be maintained, while each layer still being randomized each time, and I can thoroughly test the model in between each training. Given our limited cloud resources and since one million inputs is not too small, I want to be smart about how I got about this so I don’t potentially waste a few days of training.

def custom_collate_fn(batch):
    batch_inputs = tokenizer.pad(
        {"input_ids": [item["input_ids"] for item in batch], "attention_mask": [item["attention_mask"] for item in batch]},
        padding=True,
        return_tensors="pt"
    )
    return batch_inputs

trainer = transformers.Trainer(
    model=model,
    args=training_args,
    data_collator=custom_collate_fn, 
    train_dataset=train_dataset
)

This was honestly just generated by ChatGPT, and so before I start experimenting with this I wanted to get some feedback on this project in general as well as whether my idea to layer the training data is valid or just a fantasy.

Question #2: implementing a validation split
As mentioned, I have been lead to believe it is a good idea to create a validation split, but I am apprehensive for two reasons. One, I am not even sure how to go about this, and two, I don’t want to “waste” any of our precious training data! But obviously if in the end this will increase the performance of the chatbot, then that’s what I want to do.

Currently, I put in some extra logic to simply put 90% of the data in one folder (train) and the remaining 10% in another (validation). I can load_dataset with both splits no problem, but it’s useless to just hand this to the Trainer class without a compute_metric function, right? But what should this function even look like?

I figure I need to tokenize the validation dataset two times: first, a completed dataset, and secondly, an incomplete dataset. Then, as long as both dataset elements correspond to one another, my compute_metric should basically try to complete the incomplete validation input, perform some comparison to the completed element version, then return some data, as indicated by this script that I also just generated with ChatGPT, right?:

def custom_evaluation(eval_pred):
    predictions, labels = eval_pred
    # Convert predictions to text
    # Compare with corresponding labels (complete conversations)
    # Calculate metrics
    return metric_results

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
    compute_metrics=custom_evaluation,
)

My confusion is how to access tokenized_validation_dataset from eval_pred within the custom_evaluation function. Obviously I should just debug it and figure it out, but again, I wanted to get some feedback on the process in general before I waste any time since this was mostly just inspired by me jabbering with GPT-4 about all of this.

  • Should I just create a more robust JSON structure, with keys for “complete” and “incomplete” or something and then pass them both through the tokenized_validation_dataset object and reference them acordingly?

  • Or should I tokenize them separately and pass them individually and wrap them in an object?

  • Or should I not even tokenize them until it reaches the custom_evaluation function?

Looking into this today is what sparked me to write up this whole darn thing, and my head is spinning just thinking about it :face_with_spiral_eyes:

Further, what are the main metrics I should be returning? ChatGPT said “Calculate metrics like BLEU score, ROUGE, or other relevant metrics for text generation tasks.”, so I should probably find out what that means, but maybe one of you can recommend a better way to go about this process or maybe tell me to just forget about a validation split and put all my data in the train spit and save me some time.


If you made it this far, you rock.
Sincere thanks for any thoughts or feedback you have.
Cheers

Question 2: don’t think of the validation data as “waste”. You are using this data to evaluate how well your model can predict data it has not seen before. Without it - you will not be able to figure out if your model is over fitting. So validation data is very important. Make sure it represents the “average” distribution of chat topics if possible just as the training data set should.

I don’t understand why you need to tokenize the validation dataset again. the custom_evaluation function is going to get the predictions and the labels (or actuals). These should already be tokenized and passed to the custom_eval function. You need to compare the two and see how close they match. Since its unlikely that the prediction will match the actual exactly - you will use metrics like BLEU or ROUGE to calculate how “close” the prediction is to the actual labels.

Both are actually quite simple in how they work. It shouldn’t take you more than a few minutes to figure out what these represent. Here is a short writeup

Both offer different ways of measuring how many words (or tokens) overlap in the prediction vs. the actual to give you an idea of the quality of the prediction. BLEU has typically been used for translation models and ROUGE for text generation/summarization. Note that neither will tell you if the sentence pairs are semantically similar, i.e. that both mean the same thing.

Question 1: I don’t understand the why you don’t want to shuffle the data, but the trainer will shuffle training data by default. The approach they recommend is to override get_train_dataloader to write your own loader that does not shuffle.

@panigrah thanks for the link.

Q2

I didn’t mean to say I should tokenize them twice, what I meant is that I need to tokenize a completed version of the validation set AND an incomplete version of the validation dataset. Or is this wrong?

Meaning, I will need to basically have one set of training inputs in my validation set that is a cutoff JSON string where the AI is next in line to respond, as well as the complete, reference JSON string to compare to.

Or perhaps I only need to include the complete version, and then simply cut the JSON strings at the appropriate place within the custom_evaluation function.

I don’t immediately know how to truncate a string like this when it’s in the tokenized form, so that’s why I thought it would be better to have two versions (complete and incomplete), but perhaps it’s better to simply only use a completed validation dataset and then figure out how to slice it up after it’s already been tokenized so I can just use a single dataset.

Q1

Is it not correct that whatever a model is fine-tuned on first will have a greater impact on the weights? My concern is that since most of my training data consists of more than two users chatting, that it will be better to train the model on the instances where only two users are chatting at first. This is why I explained the five layers of the training data, where we first train on layer 1, followed by layer 2, etc., to mitigate the issue of most of the data not looking like a standard back-and-forth chat structure between one user and one AI.

The validation dataset needs to be processed the same way as the training dataset. So if your training data is being tokenised twice - do the same for validation.

am stretching beyond what I know, sorry not able to help more.

Just in case this somehow helps somebody, it appears my confusion lies in how my dataset is packaged.

Beforehand, I was just training the LLM on the completed JSON strings that I was wanting it to complete, and while this was working well enough when it came to generating complete JSON strings, but now I am essentially wrapping the JSON string in CSV with two columns: incomplete and complete.

And then I am tokenizing the data using the completed JSON strings as the labels and the incomplete JSON strings as the input_ids and attention mask.

Apparently, this will allow my validation dataset to work with a compute_metric function.

I am still updating my code, but I will share my changes after I confirm evrything is working.

Okay so I was originally tokenizing only complete JSON string inputs in a JSONL file like this:

...
{"chat": {"users": ["Tod", "AI"], "msg_count": 2, "messages": [{"user": "Tod", "text": "Tod's prompt, perhaps with \"inner quotes\""}, {"user": "AI", "text": "AI response"}]}}
...

But now I am wrapping this in CSV like this, including an incomplete JSON string and a complete JSON string:

incomplete,complete
...
"{""users"": [""Tod"", ""AI""], ""msg_count"": 2, ""messages"": [{""user"": ""Tod"", ""text"": ""Tod's prompt, perhaps with \""inner quotes\""""}, {""user"": ""AI"", ""text"": ","{""users"": [""Tod"", ""AI""], ""msg_count"": 2, ""messages"": [{""user"": ""Tod"", ""text"": ""Tod's prompt, perhaps with \""inner quotes\""""}, {""user"": ""AI"", ""text"": ""AI response""}]}"
...

Note that to ensure commas do not break the format we of course just wrap the CSV fields in double quotes, and regardless of if we are needing to escape a double quote (“) or an escaped double quote (\”) within each CSV field, we simply replace all double quotes (“) with double double quotes (”"). Awesome.

Initially, I was truncating my JSON strings after an open quote, like "user": "AI", "text": ", with the idea being that prompts would be formatted this way so the LLM could complete them accordingly, however I was having issues with the CSV parser when tokenizing because it was freaking out that there was an open quote without its corresponding close quote, and so now we cut off the JSON string before the quote, like "user": "AI", "text": .

Previously, I was tokenizing data like this:

train_dataset = load_dataset('json', data_files=training_files)
train_dataset = train_dataset.map(
        lambda examples: tokenizer(json.dumps(examples['chat']),
                                   padding=True,
                                   truncation=True))
train_dataset = train_dataset ['train']

But now, to include the validation dataset and to account for the CSV file, I am doing it as so:

dataset = load_dataset('csv', data_files={"train": train_files, "validation": validation_files})

def tokenize_function(examples):
    tokenized_inputs = tokenizer(examples['incomplete'], padding='max_length', truncation=True)
    tokenized_labels = tokenizer(examples['complete'], padding='max_length', truncation=True)

    return {
        'input_ids': tokenized_inputs['input_ids'],
        'attention_mask': tokenized_inputs['attention_mask'],
        'labels': tokenized_labels['input_ids']
    }

tokenized_dataset = dataset.map(tokenize_function, batched=True)

train_dataset = tokenized_dataset['train']
validation_dataset = tokenized_dataset['validation']

If you’re like me and totally new to all of this stuff, you may have been confused about what input_ids and labels actually corresponds to, and so please note that you can think of inputs as input_ids (makes sense) and outputs as labels (not as obvious!).

Think of a ML program that reads a bunch of Amazon reviews and then LABELS them as positive, negative, whatever… and so this is a more general ML term that applies to scenarios where the label can be something else like a generated image, or in this case, a completed JSON string. Note that we are setting the input_ids from the completed values to the labels.

So by tokenizing our data in this way with the input_ids and labels pointing towards different CSV columns, we can use this in conjunction with a compute_metric function as so:

def compute_metrics(eval_pred: EvalPrediction):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    bleu = sacrebleu.corpus_bleu(decoded_preds, [decoded_labels]).score
    rouge = Rouge()
    scores = rouge.get_scores(decoded_preds, decoded_labels, avg=True)
    return {"bleu": bleu, "rouge-l": scores["rouge-l"]["f"]}

trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    compute_metrics=compute_metrics,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
)

I have not actually tested this yet, but my model is cooking in Colab now so we will see what happens.
(Update: I ran out of memory :sunglasses: but we have a proper VM getting set up so let’s hope it doesn’t take days to train and then fail before finishing :melting_face:)

I hope this helps somebody.

In hindsight, now that I realize you can more explicitly tell the Trainer class how a JSON string should be completed, it may be totally unnecessary for me to include the users array and the msg_count value, so I will definitely do some experimenting with this extra metadata removed to see if the LLM ever generates text for extra users or responds more than once, since as I mentioned in my previous post, this LLM is essentially being trained on group chats but intended to be used in chats with only a user and an AI, and so I included this extra metadata to mitigate the LLM responding from random perspectives or too many times.

Also, I am bummed to read in these two links here and here that it is not so simple to simply turn off shuffling datasets, as I sorta put in a decent amount of work in my harvest process to order these inputs in a way where I will control how my data is shuffled while still maintaining a scaffolding within the data, but again, I am not even sure if I am correct in my understanding that the order matters in the way I described above, but what can ya do? I guess I can try to override the settings, which I tried, but so far have not gotten to work.

Anyways, hope this maybe helps someone who was confused as I was!