Questions about ordering training inputs when fine-tuning models

Okay so I was originally tokenizing only complete JSON string inputs in a JSONL file like this:

...
{"chat": {"users": ["Tod", "AI"], "msg_count": 2, "messages": [{"user": "Tod", "text": "Tod's prompt, perhaps with \"inner quotes\""}, {"user": "AI", "text": "AI response"}]}}
...

But now I am wrapping this in CSV like this, including an incomplete JSON string and a complete JSON string:

incomplete,complete
...
"{""users"": [""Tod"", ""AI""], ""msg_count"": 2, ""messages"": [{""user"": ""Tod"", ""text"": ""Tod's prompt, perhaps with \""inner quotes\""""}, {""user"": ""AI"", ""text"": ","{""users"": [""Tod"", ""AI""], ""msg_count"": 2, ""messages"": [{""user"": ""Tod"", ""text"": ""Tod's prompt, perhaps with \""inner quotes\""""}, {""user"": ""AI"", ""text"": ""AI response""}]}"
...

Note that to ensure commas do not break the format we of course just wrap the CSV fields in double quotes, and regardless of if we are needing to escape a double quote (“) or an escaped double quote (\”) within each CSV field, we simply replace all double quotes (“) with double double quotes (”"). Awesome.

Initially, I was truncating my JSON strings after an open quote, like "user": "AI", "text": ", with the idea being that prompts would be formatted this way so the LLM could complete them accordingly, however I was having issues with the CSV parser when tokenizing because it was freaking out that there was an open quote without its corresponding close quote, and so now we cut off the JSON string before the quote, like "user": "AI", "text": .

Previously, I was tokenizing data like this:

train_dataset = load_dataset('json', data_files=training_files)
train_dataset = train_dataset.map(
        lambda examples: tokenizer(json.dumps(examples['chat']),
                                   padding=True,
                                   truncation=True))
train_dataset = train_dataset ['train']

But now, to include the validation dataset and to account for the CSV file, I am doing it as so:

dataset = load_dataset('csv', data_files={"train": train_files, "validation": validation_files})

def tokenize_function(examples):
    tokenized_inputs = tokenizer(examples['incomplete'], padding='max_length', truncation=True)
    tokenized_labels = tokenizer(examples['complete'], padding='max_length', truncation=True)

    return {
        'input_ids': tokenized_inputs['input_ids'],
        'attention_mask': tokenized_inputs['attention_mask'],
        'labels': tokenized_labels['input_ids']
    }

tokenized_dataset = dataset.map(tokenize_function, batched=True)

train_dataset = tokenized_dataset['train']
validation_dataset = tokenized_dataset['validation']

If you’re like me and totally new to all of this stuff, you may have been confused about what input_ids and labels actually corresponds to, and so please note that you can think of inputs as input_ids (makes sense) and outputs as labels (not as obvious!).

Think of a ML program that reads a bunch of Amazon reviews and then LABELS them as positive, negative, whatever… and so this is a more general ML term that applies to scenarios where the label can be something else like a generated image, or in this case, a completed JSON string. Note that we are setting the input_ids from the completed values to the labels.

So by tokenizing our data in this way with the input_ids and labels pointing towards different CSV columns, we can use this in conjunction with a compute_metric function as so:

def compute_metrics(eval_pred: EvalPrediction):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    bleu = sacrebleu.corpus_bleu(decoded_preds, [decoded_labels]).score
    rouge = Rouge()
    scores = rouge.get_scores(decoded_preds, decoded_labels, avg=True)
    return {"bleu": bleu, "rouge-l": scores["rouge-l"]["f"]}

trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    compute_metrics=compute_metrics,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
)

I have not actually tested this yet, but my model is cooking in Colab now so we will see what happens.
(Update: I ran out of memory :sunglasses: but we have a proper VM getting set up so let’s hope it doesn’t take days to train and then fail before finishing :melting_face:)

I hope this helps somebody.

In hindsight, now that I realize you can more explicitly tell the Trainer class how a JSON string should be completed, it may be totally unnecessary for me to include the users array and the msg_count value, so I will definitely do some experimenting with this extra metadata removed to see if the LLM ever generates text for extra users or responds more than once, since as I mentioned in my previous post, this LLM is essentially being trained on group chats but intended to be used in chats with only a user and an AI, and so I included this extra metadata to mitigate the LLM responding from random perspectives or too many times.

Also, I am bummed to read in these two links here and here that it is not so simple to simply turn off shuffling datasets, as I sorta put in a decent amount of work in my harvest process to order these inputs in a way where I will control how my data is shuffled while still maintaining a scaffolding within the data, but again, I am not even sure if I am correct in my understanding that the order matters in the way I described above, but what can ya do? I guess I can try to override the settings, which I tried, but so far have not gotten to work.

Anyways, hope this maybe helps someone who was confused as I was!