Questions about ordering training inputs when fine-tuning models

capnchat · December 4, 2023, 8:53pm

Okay so I was originally tokenizing only complete JSON string inputs in a JSONL file like this:

...
{"chat": {"users": ["Tod", "AI"], "msg_count": 2, "messages": [{"user": "Tod", "text": "Tod's prompt, perhaps with \"inner quotes\""}, {"user": "AI", "text": "AI response"}]}}
...

But now I am wrapping this in CSV like this, including an incomplete JSON string and a complete JSON string:

incomplete,complete
...
"{""users"": [""Tod"", ""AI""], ""msg_count"": 2, ""messages"": [{""user"": ""Tod"", ""text"": ""Tod's prompt, perhaps with \""inner quotes\""""}, {""user"": ""AI"", ""text"": ","{""users"": [""Tod"", ""AI""], ""msg_count"": 2, ""messages"": [{""user"": ""Tod"", ""text"": ""Tod's prompt, perhaps with \""inner quotes\""""}, {""user"": ""AI"", ""text"": ""AI response""}]}"
...

Note that to ensure commas do not break the format we of course just wrap the CSV fields in double quotes, and regardless of if we are needing to escape a double quote (“) or an escaped double quote (\”) within each CSV field, we simply replace all double quotes (“) with double double quotes (”"). Awesome.

Initially, I was truncating my JSON strings after an open quote, like "user": "AI", "text": ", with the idea being that prompts would be formatted this way so the LLM could complete them accordingly, however I was having issues with the CSV parser when tokenizing because it was freaking out that there was an open quote without its corresponding close quote, and so now we cut off the JSON string before the quote, like "user": "AI", "text": .

Previously, I was tokenizing data like this:

train_dataset = load_dataset('json', data_files=training_files)
train_dataset = train_dataset.map(
        lambda examples: tokenizer(json.dumps(examples['chat']),
                                   padding=True,
                                   truncation=True))
train_dataset = train_dataset ['train']

But now, to include the validation dataset and to account for the CSV file, I am doing it as so:

dataset = load_dataset('csv', data_files={"train": train_files, "validation": validation_files})

def tokenize_function(examples):
    tokenized_inputs = tokenizer(examples['incomplete'], padding='max_length', truncation=True)
    tokenized_labels = tokenizer(examples['complete'], padding='max_length', truncation=True)

    return {
        'input_ids': tokenized_inputs['input_ids'],
        'attention_mask': tokenized_inputs['attention_mask'],
        'labels': tokenized_labels['input_ids']
    }

tokenized_dataset = dataset.map(tokenize_function, batched=True)

train_dataset = tokenized_dataset['train']
validation_dataset = tokenized_dataset['validation']

If you’re like me and totally new to all of this stuff, you may have been confused about what input_ids and labels actually corresponds to, and so please note that you can think of inputs as input_ids (makes sense) and outputs as labels (not as obvious!).

Think of a ML program that reads a bunch of Amazon reviews and then LABELS them as positive, negative, whatever… and so this is a more general ML term that applies to scenarios where the label can be something else like a generated image, or in this case, a completed JSON string. Note that we are setting the input_ids from the completed values to the labels.

So by tokenizing our data in this way with the input_ids and labels pointing towards different CSV columns, we can use this in conjunction with a compute_metric function as so:

def compute_metrics(eval_pred: EvalPrediction):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    bleu = sacrebleu.corpus_bleu(decoded_preds, [decoded_labels]).score
    rouge = Rouge()
    scores = rouge.get_scores(decoded_preds, decoded_labels, avg=True)
    return {"bleu": bleu, "rouge-l": scores["rouge-l"]["f"]}

trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    compute_metrics=compute_metrics,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
)

I have not actually tested this yet, but my model is cooking in Colab now so we will see what happens.
(Update: I ran out of memory but we have a proper VM getting set up so let’s hope it doesn’t take days to train and then fail before finishing )

I hope this helps somebody.

In hindsight, now that I realize you can more explicitly tell the Trainer class how a JSON string should be completed, it may be totally unnecessary for me to include the users array and the msg_count value, so I will definitely do some experimenting with this extra metadata removed to see if the LLM ever generates text for extra users or responds more than once, since as I mentioned in my previous post, this LLM is essentially being trained on group chats but intended to be used in chats with only a user and an AI, and so I included this extra metadata to mitigate the LLM responding from random perspectives or too many times.

Also, I am bummed to read in these two links here and here that it is not so simple to simply turn off shuffling datasets, as I sorta put in a decent amount of work in my harvest process to order these inputs in a way where I will control how my data is shuffled while still maintaining a scaffolding within the data, but again, I am not even sure if I am correct in my understanding that the order matters in the way I described above, but what can ya do? I guess I can try to override the settings, which I tried, but so far have not gotten to work.

Anyways, hope this maybe helps someone who was confused as I was!

Topic		Replies	Views
My dataset is 250 docs, is multiple hours tuning normal? Beginners	17	274	January 29, 2025
Fine Tuning Format/Structure for data for llma3.1 models Intermediate	0	63	October 28, 2024
Technical clarification on the validation data vs. the training data in the trainer API 🤗Transformers	1	764	January 6, 2022
Fine tuning with conversation dialog data Intermediate	0	134	September 20, 2024
Pretokenization of dataset for finetuning 🤗Datasets	4	70	May 31, 2025

Questions about ordering training inputs when fine-tuning models

Related topics