Context:
I am working on a project where I am trying to build a chatbot trained on industry-specific data.
This is my first experience with anything touching ML and I am learning as I go.
Challenge of using group-chat data to train a standard chatbot:
One interesting challenge for this project is that I am using data from industry-specific message board conversations where there are mostly more than two users chatting in an unstructured, unpredictable pattern, but attempting to train a chatbot that will conform to a traditional, back-and-forth structure with only two users, namely the user and the AI.
To reduce the risk of the model generating text from multiple perspectives as well as responding too many times, I am including metadata in the training data and the prompts that the model completes.
Training Input Examples:
Here is an example of a training input with only two users:
{"chat": {"users": ["David", "Robert"], "msg_count": 2, "messages": [{"user": "David", "text": "Example message 1"}, {"user": "Robert", "text": "Example message 2"}]}}
But again, it is very important to note that there are mostly scenarios with more than two users chatting, as below.
{"chat": {"users": ["David", "Robert", "Max"], "msg_count": 4, "messages": [{"user": "David", "text": "Example message 1"}, {"user": "Robert", "text": "Example message 2"}, {"user": "David", "text": "Example message 3"}, {"user": "Max", "text": "Example message 4"}]}}
Although I condense consecutive messages from the same user into a single message so users can not speak more than once in a row, naturally there is no predictable pattern in the order that users respond, and so this is why I am including the array of users
in the metadata outside of the messages
array, so that the LLM will not respond as an irrelevant user
, as well as including the msg_count
to prevent the LLM from responding too many times.
Tokenizing data
I am tokenizing the training dataset as below, where training_files
is a list of JSONL files (the order of these files is important as described later on below):
train_dataset = load_dataset('json', data_files=training_files)
train_dataset = train_dataset.map(
lambda examples: tokenizer(json.dumps(examples['chat']),
padding=True,
truncation=True))
train_dataset = train_dataset ['train']
Completing Prompts
Note that I am simply tokenizing json.dumps()
on the key-value chat
(as opposed to extracting data and building a string using special tokens like ###Input: or ###Output: or whatever) with the intention of completing prompts as below:
INPUT:
{"users": ["Tod", "AI"], "msg_count": 2, "messages": [{"user": "Tod", "text": "Tod's prompt"}, {"user": "AI", "text": "
â JSON string is cut off here
COMPLETED OUTPUT:
{"users": ["Tod", "AI"], "msg_count": 2, "messages": [{"user": "Tod", "text": "Tod's prompt"}, {"user": "AI", "text": "AI response"}]}
I know that this is not a standard way of doing things, but I am including the extra metadata in an attempt to address the below:
- Prevent the model from responding as irrelevant users (
users
array), given that the training data mostly includes more than two users chatting. - Prevent the model from responding too many times (
msg_count
value), given that the model may otherwise complete messages too many times, and/or speak for the user in addition to the AI. - Capture a message history, as opposed to one-off questions & answers.
- It is convenient to have the data in JSON format so it can more easily be parsed and manipulated in between API calls, though obviously it will require validation.
So the idea is that by including the users
array and the msg_count
, the AI will respond only as the AI and only the correct amount of times.
Notes on the number of inputs, models, and environment:
The training dataset includes about one million training inputs, but I have only been able to test with about one thousand in Google Colab because of the environment limitations there, but soon I am going to be testing this in a proper environment with all 1M inputs. Most of my testing has been with falcon-7b and so far the results are promising.
I promise there is a question coming and if youâve read this far I really appreciate it! I am going into detail because in addition to having some specific questions, Iâd love to get peopleâs thoughts on this process in general.
Layering the training data
During the data harvesting process, I am writing chats with only two users to individual files, so the LLM can be fine-tuned with this data first. Later on, the chats with more than two users are used, as the chat.messages[i].text
fields still contain valuable industry-specific sequences.
There are actually five layers to the training data:
-
General Knowledge: Manually created training inputs, covering âgeneral industry knowledgeâ as well as basic company info that is not covered in the message boards, like elementary industry knowledge, support questions like âI am having trouble with my accountâ, etc. These are always structured with two users, and I am also including manners, such as greeting the user at the start of the message as well as inviting the user to keep chatting. I am using GPT-4 to help create varied versions of each input to increase the size of this layer. Also,
users[1]
is always the AI in order to hopefully give these inputs more weight since in productionusers[1]
will always be the AI, but in the real training data, theusers
array is always varied. These inputs are first so the LLM can get a basic idea of the JSON structure as well as hopefully imprint these mannerisms whenusers[1]
is the AI. -
Two Users: All training inputs where there are only two users in the chat. The idea is that the LLM will be be more weighted towards these conversations that fit our use case, where only one user and the AI are chatting.
-
Message History Growth: Every training input from layer two is represented here again, except we show the
messages
array growing two to three times, consecutively and starting from a random index. The idea here is that the LLM will see howmessages
grows in general as well as in relation to themsg_count
value. Themessages
array can sometimes have thousands of replies, and so we break them into multiple inputs so each JSON structure does not exceed the context length of a particular model so themessages
arrays contain at most ~30 elements depending on their sizes, though often theyâre much, much smaller; however, to avoid over-fitting the model on these inputs, that is why we only show them growing two to three times and start at a random index so the entirety of the data is not repeated in full. We always use an evenmsg_count
to represent a user-AI pair of messages.
Example:
{"chat": {"users": ["David", "Robert"], "msg_count": 2, "messages": [{"user": "David", "text": "Example message 1"}, {"user": "Robert", "text": "Example message 2"}]}}
{"chat": {"users": ["David", "Robert"], "msg_count": 4, "messages": [{"user": "David", "text": "Example message 1"}, {"user": "Robert", "text": "Example message 2"}, {"user": "David", "text": "Example message 3"}, {"user": "Robert", "text": "Example message 4"}]}}
{"chat": {"users": ["David", "Robert"], "msg_count": 6, "messages": [{"user": "David", "text": "Example message 1"}, {"user": "Robert", "text": "Example message 2"}, {"user": "David", "text": "Example message 3"}, {"user": "Robert", "text": "Example message 4"}, {"user": "David", "text": "Example message 5"}, {"user": "Robert", "text": "Example message 6"}]}}
-
Gradient Layer: Instead of having an immediate change from only two users in the chat to the scenarios where there are more than two users in the chat (and sometimes only one if a single message exceeds the context length), we have a layer here where we mix training inputs from the two-users scenarios and the not-two-users scenarios, and we also include the same growth pattern as in layer 3, except we scramble all of the inputs so the growth is not represented consecutively, with the idea that the growth pattern will be more abstractly represented here.
-
Not Two Users: We show this data last, where there are not only two users in the chat. Since this is the bulk of our data, we show this last after the model has (hopefully) thoroughly learned from the two-users scenarios. The idea here is that even though it is not structured exactly as our use case requires, the message content itself contains industry-specific sequences that are still valuable, and we mitigate the structural issues by including the extra metadata as explained previously.
Side questions
-
Do you agree that it is reasonable to layer the data in this way in an attempt to mitigate the issue of the majority of data involving more than two users?
-
Do you think Layer 2 is a good idea, or too much of an over-fitting risk? It also increases the size of our two-users dataset, which is ideal, right?
-
Do you think Layer 4 is a good idea, or is it fine to go from two users to more than two users without blending everything.
-
Do you think one million inputs will be enough to make a dent in falcon-7b for a truly impressive chatbot? Ideally the chatbot will be more adept at answering questions related to the industry (which is dentistry, by the way) than GPT-4, which is a pretty tall order. It is being trained on 20-years-worth of dentists and dental professionals chatting online, plus podcast transcripts of dentists and dental professionals chatting, as well as hundreds of dental magazine articles that have been pretty well formatted to a chat structure.
-
Dumb question, but do you think falcon-40b would be significantly better than falcon-7b for being able to chat about advanced dental topics? Or will it be overkill? I am also interested in MPT-30b mostly for the large context length, but I am concerned with the extra cost/time training and generating with these models. I have yet to successfully train MPT-7b in Colab, and I have gotten LLaMA-2-7b to run and generate output, but it doesnât appear to actually be learning from the data so I am doing something wrong, but havenât had much time to work with it much.
Here are my two real questions:
- How can I ensure the
Trainer
class does not shuffle all my training data, since I am intentionally wanting to train the model in the ordered layers described above. - Given we âonlyâ have one million training inputs, I donât want to lose 10% of the data to create a validation split, not to mention I am not exactly sure how to do this, but have been lead to believe it is probably more effective to include this.
Question #1: avoid shuffling the train split:
Is this a fine way to set the data_collator
to avoid shuffling inputs? Or is there a better approach?
Note that I already have logic in place to shuffle the data during the harvest. I figure that instead of running multiple epochs, I can re-harvest the data in between trainings, this way the order of the layers can be maintained, while each layer still being randomized each time, and I can thoroughly test the model in between each training. Given our limited cloud resources and since one million inputs is not too small, I want to be smart about how I got about this so I donât potentially waste a few days of training.
def custom_collate_fn(batch):
batch_inputs = tokenizer.pad(
{"input_ids": [item["input_ids"] for item in batch], "attention_mask": [item["attention_mask"] for item in batch]},
padding=True,
return_tensors="pt"
)
return batch_inputs
trainer = transformers.Trainer(
model=model,
args=training_args,
data_collator=custom_collate_fn,
train_dataset=train_dataset
)
This was honestly just generated by ChatGPT, and so before I start experimenting with this I wanted to get some feedback on this project in general as well as whether my idea to layer the training data is valid or just a fantasy.
Question #2: implementing a validation split
As mentioned, I have been lead to believe it is a good idea to create a validation split, but I am apprehensive for two reasons. One, I am not even sure how to go about this, and two, I donât want to âwasteâ any of our precious training data! But obviously if in the end this will increase the performance of the chatbot, then thatâs what I want to do.
Currently, I put in some extra logic to simply put 90% of the data in one folder (train) and the remaining 10% in another (validation). I can load_dataset
with both splits no problem, but itâs useless to just hand this to the Trainer
class without a compute_metric
function, right? But what should this function even look like?
I figure I need to tokenize the validation dataset two times: first, a completed dataset, and secondly, an incomplete dataset. Then, as long as both dataset elements correspond to one another, my compute_metric
should basically try to complete the incomplete validation input, perform some comparison to the completed element version, then return some data, as indicated by this script that I also just generated with ChatGPT, right?:
def custom_evaluation(eval_pred):
predictions, labels = eval_pred
# Convert predictions to text
# Compare with corresponding labels (complete conversations)
# Calculate metrics
return metric_results
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_validation_dataset,
compute_metrics=custom_evaluation,
)
My confusion is how to access tokenized_validation_dataset
from eval_pred
within the custom_evaluation function. Obviously I should just debug it and figure it out, but again, I wanted to get some feedback on the process in general before I waste any time since this was mostly just inspired by me jabbering with GPT-4 about all of this.
-
Should I just create a more robust JSON structure, with keys for âcompleteâ and âincompleteâ or something and then pass them both through the
tokenized_validation_dataset
object and reference them acordingly? -
Or should I tokenize them separately and pass them individually and wrap them in an object?
-
Or should I not even tokenize them until it reaches the
custom_evaluation
function?
Looking into this today is what sparked me to write up this whole darn thing, and my head is spinning just thinking about it
Further, what are the main metrics I should be returning? ChatGPT said âCalculate metrics like BLEU score, ROUGE, or other relevant metrics for text generation tasks.â, so I should probably find out what that means, but maybe one of you can recommend a better way to go about this process or maybe tell me to just forget about a validation split and put all my data in the train spit and save me some time.
If you made it this far, you rock.
Sincere thanks for any thoughts or feedback you have.
Cheers