Fine-tuning BERT Model on domain specific language and for classification

Hi guys

First of all, what I am trying to do: I want to fine-tune a BERT Model on domain specific language and in a second step further fine-tune it for classification. To do so, I want to use a pretrained model, what forces me to use the original tokenizer (cannot use own vocab). I would like to share my code with you and have your opinions (are there mistakes?):

First we load the pre-trained tokenizer and model:

from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

We are using BertForMaskedLM since the first fine-tuning step is to train the model on domain specific language (a text file with one sentence per line). Next we are reading the text file:

from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
        tokenizer=tokenizer,
        file_path="test.txt",
        block_size=128
)

and define the data collator as:

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

Finally we are training the model for MLM:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./TestBERT",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=16,
    save_steps=10_000,
    save_total_limit=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

trainer.train()

We save the model and reload it for sequence classification (huggingface handles the heads):

from transformers import BertForSequenceClassification

trainer.save_model("./TestBERT")
model = BertForSequenceClassification.from_pretrained("./TestBERT", num_labels=2)

Finally we can fine-tune the model for sequence classification as usual. E.g.:

!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz


from pathlib import Path
def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)
    return texts, labels

train_texts, train_texts= read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')


from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, further2, test_size=.2)


train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)


import torch
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)


from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=8,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

Does anyone detect any obvious mistakes I am making or is this the correct proceeding? Further I want to freeze some layers during the first fine-tuning step to avoid forgetting (of the pre-trained learning). I assume I would have to write my own trainer for it (will do any maybe comment on this post).

Best

Does anyone detect any obvious mistakes I am making or is this the correct proceeding

This question seems pretty vague, could maybe post a specific question so we can help better :slight_smile:

So my question is, if this is the correct approach to achieve (especially the domain specific fine-tuning) my goal of first further pre-training the model on domain specific language and then use and fine-tune it for sequence classification?

Most of the time we do not first finetune the MLM and then finetune further for classification. We usually only finetune on the classification directly. Results should not vary significantly depending on how different your datasets is from general domain. You can also incrementally unfreeze the LM during the classification task. First finetuning MLM does not seem worth it to me, but if your experiments show that for your dataset/task it does provide better performance, then that is fine.

Thank you for the answer. I also think that to actually achieve better results, the model needs to be trained from scratch (with specific vocab). But I wanted to give it a try. Do you have any experience with how long it takes to train from scratch to get similar results to pre-trained models?

I know that with

for param in model.base_model.parameters():
    param.requires_grad = False

one can gradually unfreeze/freeze some layers but don’t know how to implement that with the huggingface trainer, since I do not manually define the loops for the epochs and cannot choose and adapt the number of layers to freeze in each epoch.