Training over an already trained transformer model

shubham-salt · October 11, 2022, 8:23am

I took a pretrained BERT model and fine tuned it for text classification using a dataset(~3mn records, 46 categories).

Now I want to add some data(~5k records, 10 categories) to the model while keeping the original 46 categories. I just want the model to have all the latest data.

I want to avoid retraining with the full(3mn+5k) data because of time and costs and also because can be recurring activity (3-4 times a week)

Is there a way to do this?

Below is my code setup. I am using HF’s trainer

# imports
import torch
from transformers import TrainerCallback
from transformers import TrainingArguments, Trainer
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import EarlyStoppingCallback
  
# constants
device = torch.device("cuda")
MODEL_NAME = 'bert-large-uncased'
TRAINING_EPOCHS = 20
TRAINING_BATCH_SIZE = 400
EVAL_BATCH_SIZE = 100
  
# dataset from pandas df
tr_dataset = Dataset(x_tr, tr_df.label_encoded.values.tolist())
te_dataset = Dataset(x_te, te_df.label_encoded.values.tolist())
  
# download model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=n_out, ).to(device)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
  
# define training args
args = TrainingArguments(
  output_dir=SAVE_BERT_PATH,
  overwrite_output_dir=True,
  evaluation_strategy="epoch",
  save_strategy="no",
  per_device_train_batch_size=TRAINING_BATCH_SIZE,
  per_device_eval_batch_size=EVAL_BATCH_SIZE,
  num_train_epochs=TRAINING_EPOCHS,
  seed=42,
  fp16=True,
  dataloader_num_workers = 10,
  load_best_model_at_end=False,
  metric_for_best_model="eval_loss",
  greater_is_better=False,
  logging_strategy='epoch',
  logging_first_step=True
  
)
  
# define trainer
trainer = Trainer(
  model=model,
  args=args,
  train_dataset=tr_dataset,
  eval_dataset=te_dataset,
  compute_metrics=compute_metrics)
  
# train and eval
trainer.train()
trainer.evaluate()

marshmellow77 · October 11, 2022, 8:53am

Hi Shubham

Maybe I’m missing something here, but the easiest way seems to be to load your finetuned model (instead of bert-large-uncased) and do additional finetuning on the 5k records. It’s basically the same process as to when you finetuned the model for the first time, but this time with your model and only the additional records.

Or maybe there is somthing that I’m missing that would prevent you from doing that?

Cheers
Heiko

shubham-salt · October 11, 2022, 1:48pm

Hi Heiko. I was doing a step wrong here. I was loading the trained model and specifying the number of categories(like we do with pretrained models) which I shouldnt have been doing.

I trained over the model like you suggested and it works fine now. Thanks for the help!

rgwatwormhill · January 8, 2023, 10:23pm

When you do this, I believe you will find that the most recent data is treated as if it were the most important data. That is, the model will “unlearn” some of the earlier information.
To treat the new data and the old data “fairly”, I think you would need to include the old data in your recent training alongside the new data.
It is possible that the difference is not important.
(I am not an expert).

Topic		Replies	Views
Using Trainer for BertForPretraining does not work 🤗Transformers	1	1349	April 6, 2022
Evaluating Finetuned BERT Model for Sequence Classification Beginners	10	8487	October 25, 2022
Pre-train BERT with HF Trainer 🤗Transformers	0	739	April 22, 2022
Original Bert Pretraining Intermediate	0	546	January 10, 2022
Trouble saving and loading a finetuned model Beginners	1	309	July 7, 2024

Training over an already trained transformer model

Related topics