PEFT for Token Classification with Large Language Models

Hi folks. I am attempting to use large language models (specifically Phi-3-mini) as a Token Classifier. This was recently made easy to do with the transformers library thanks to the Phi3ForTokenClassification implementation. I am having difficulty training this model via Parameter Efficient Fine Tuning (PEFT, i.e. LoRa).

I am creating an instance of Phi3ForTokenClassification from the pre-trained Phi-3-mini model as follows:

model =  Phi3ForTokenClassification.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    attn_implementation="flash_attention_2",
    num_labels=len(labels_vocab),
    id2label=id2label,
    label2id=label2id,
    use_cache=False,
    torch_dtype=torch.bfloat16
)

As expected, since the head of this model is getting replaced with a linear layer for predicting the one-hot-encoded token labels, I get the warning that that specific layer has not been trained yet:

Some weights of Phi3ForTokenClassification were not initialized from the model checkpoint at microsoft/Phi-3-mini-4k-instruct and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

At this point. I am assuming that I need to perform a fine-tuning of the core mode layers (i.e. attention heads / mlp, etc.) and a full training of that last classifier layer.

I am training on a GTX 4090 (24GB of NVRAM). As such, I need to leverage PEFT with a LoRa which I configure as follows:

peft_config = LoraConfig(
    r=32,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    task_type="TOKEN_CLS",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    modules_to_save=["classifier"],
    inference_mode=False
)
peft_model = get_peft_model(model, peft_config)

When checking the number of trainable parameters, I get trainable params: 18,000,953 || all params: 3,740,755,058 || trainable%: 0.4812. Which seems right to me.

Based on what I’ve research on modules_to_save. This seems like the right configuration and would result in a full training of the classifier module of the model. When I print the model details, this is the classifier layer: (classifier): Linear(in_features=3072, out_features=57, bias=True).

Since this is LoRa and we’re training new weights, I drafted my training configuration with a fairly aggressive learning rate, as follows:

training_args = TrainingArguments(
    bf16=True,
    output_dir="outputs",
    learning_rate=(2e-4 * 4),
    gradient_accumulation_steps=4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    logging_steps=10,
    logging_strategy="steps",
    eval_strategy="epoch",
    save_strategy="epoch",
    report_to="wandb"
)

And I am training with:

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    # calculates precision, recall, accuracy, and f1
    compute_metrics=compute_metrics,
)

The training seems to run correctly. Every epoch the prevision, recall, accuracy, and f1 scores all seem reasonable and improving. After the 1st epoch, my f1 score is ~0.66, improving to ~0.72 in the 2nd epoch.

Once my short training run is complete. I save the model as follows:

merged_model = trainer.model.merge_and_unload()
merged_model.save_pretrained("model-name")

To test my model, I load it for inference as follows:

model = AutoModelForTokenClassification.from_pretrained("model-name")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

token_classifier = pipeline("ner", model=model, tokenizer=tokenizer)

I get very poor results during inference. I am not completely convinced I am doing this correctly. I made a few assumption above regarding how the training would work that I am not sure are correct.

I rented an A6000Ada to do a full (non-PEFT) training on the same dataset. After 2 epochs, the training had a lower accuracy, precision, recall, and f1 score. But, it performed significantly better when doing test inference.

Does anyone have any suggestion regarding how I can make this better. I am not afraid to go deep dive material. I have a ton to learn and I’m here for it. Thanks in advance!

1 Like

Does anyone have any insights? Sorry to bump this. Not sure where else to ask.