Training a domain-specific roberta from roberta-base

Hey there,

I apologize in advance if the question below is simple but I’m new to transformers and I want to make sure I get things right before wasting my GPU time training the “wrong” model.

The goal: I want to train a domain-specific roberta model, building on the pre-trained roberta model, therefore starting from roberta-base’s weights rather than from scratch.

The issue: I first followed your tutorial , before realizing that the weights were not initialized on roberta-base’s before training.

My question: What are the correct steps to train a domain-specific model on-top of roberta-base?

  1. Train a ByteLevelBPETokenizer on my data

tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=50_000, min_frequency=2,
special_tokens=["<s>", “<pad>”, “</s>”, “<unk>”, “<mask>”])
tokenizer.save_model(“mymodel”)’

and use it to preprocess my data, FYI I have 700,000 sentences stored in a txt file, one sentence per line.

from transformers import LineByLineTextDataset
training_dataset = LineByLineTextDataset(tokenizer=tokenizer, file_path=“data/training.txt”, block_size=128,)
evalutation_dataset = LineByLineTextDataset(tokenizer=tokenizer, file_path=“data/testing.txt”, block_size=128,)

  1. Get the roberta config

from transformers import RobertaConfig, RobertaForMaskedLM
config = RobertaConfig( vocab_size=52_000, max_position_embeddings=514,
num_attention_heads=12, num_hidden_layers=12, type_vocab_size=1,)
config.save_model(“mymodel”)
model = RobertaForMaskedLM(config=config)

Or should I instead use:

from transformers import RobertaForMaskedLM
model= RobertaForMaskedLM.from_pretrained(‘roberta-base’)

  1. Get data Collaor

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

  1. Set up training arguments and train

from transformers import Trainer, TrainingArguments, EvalPrediction

What I currently have:

training_args = TrainingArguments(
output_dir="./mymodel",
evaluation_strategy=“steps”,
prediction_loss_only=True,
per_device_train_batch_size= 32,
per_device_eval_batch_size=32,
eval_accumulation_steps = 200,
weight_decay=0.01,
adam_epsilon=1e-6,
max_steps=200000,
warmup_steps=1,
save_steps=200,
save_total_limit=5,
eval_steps= 100,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=training_dataset,
eval_dataset=evalutation_dataset,
)
trainer.train()

Should I instead use:

python run_mlm.py
–model_name_or_path roberta-base
–train_file path_to_train_file
–validation_file path_to_validation_file
–do_train
–do_eval
–output_dir /tmp/test-mlm

Thanks in advance for any help you can provide!

Hi aberquand,

I don’t think you can use the pre-trained weights with your domain-specific vocabulary.

[I am not an expert, and I’ve only used BERT not RoBERTA, and I didn’t use the Trainer, so I could be wrong.]

If I understand it correctly, the way the weights learn is dependent on the particular vocabulary.

I suggest you use the pre-trained vocabulary as well as the pre-trained weights.

How different is your vocabulary from the original RoBERTA vocabulary? I would expect 700000 sentences would be enough to do fine-tuning, but probably not enough to train from scratch.

Did you know, if you use Google Colaboratory you can get a limited amount of GPU-time for free. It generally maxes out after about 7 hours each day. Colab uses Jupyter notebooks.

For more information on tokenizer vocabulary, I recommend Chris McCormick’s blogs, eg https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/

1 Like

Hi @rgwatwormhill, thanks for the answer!

Actually indeed that would make sense to need the roberta vocabulary with the pre-trained weights, I was trying to do something similar to BioBERT (https://arxiv.org/ftp/arxiv/papers/1901/1901.08746.pdf) which seem to have used BERT weight initialization and a biomedical domain corpora…

I guess I’ll try both:

  • pre-train my domain specific with my tokenizer + from scratch
  • initialize training with roberta and continue training from there with my domain specific corpus processed with roberta tokenizer.

Still wondering how to initialize the weights to roberta’s though :confused:

ps: yes using GCP credits and a paid Nvidia V100 for the simulation now, colab free GPU was not enough :sweat_smile:

The term that you are looking for is fine-tuning, (or in this case perhaps further pretraining because it is the same task objective). You can easily do that with the example scripts that are available in the Github repository.

As @rgwatwormhill notes, you cannot create a new vocabulary and use that with a pretrained model: that model is trained with a specific vocabulary, and swapping that out for another vocab will lead to unexpected results. If I were you I’d first try fine-tuning the existing model with the script above, and if the performance is not good enough for you, you can still train from scratch. Note that you need a lot of data to get good results (that’s why typically finetuning is preferred).

1 Like

hello again,

I’ve had a look at the BioBERT article. Fig 1 does look as though they (pre-) train with medical texts using BERT base weights, but Section 3.3 clarifies that actually they used the standard BERT base vocabulary and that the training of BioBERT effectively started with the standard Wiki+Books data and then was followed by the medical texts. So, the BioBERT article is all about fine-tuning a pre-trained BERT model.

There can be as many training steps as you care to imagine when doing NLP with BERT. So far as I know, people only use two terms for it: pre-training and fine-tuning. We could really do with a term for the intermediate steps. I call it intermediate-training.

Pre-training: that which somebody else has already done.
Intermediate training: further modification, for example using language modelling with domain-specific texts.
Fine-tuning: training to the task of interest.

The Intermediate stages can also include other steps such as Distilling or Pruning, and might include Freezing some or all of the layers.

Note that the BioBERT article says “pre-trained on biomedical corpora”, but I would classify that as Intermediate training. I suppose, if you pick up their finished model, you would then consider both the Wiki+Books and the medical corpus as pre-training.

If anyone knows a standard classification of what counts as “pre-training” I’d like to know. @BramVanroy maybe?

Agreed, BioBERT uses in fact the BERT-base vocabulary.
Also interested by a better classification of what “pre-training” encompasses, perhaps could use "post-training’ or ‘further training’ as suggested by @BramVanroy.

Another relevant work for anyone interested to compare the effect of using BERT-base vocab or your own vocab: SCIBERT (https://arxiv.org/pdf/1903.10676.pdf) further trained a BERT Base model with Scientific data, either using BERT’s vocab or their own SciVocab.

But based on your previous comments and the size of my training set, I’ll start by building on top of roberta-base. Thanks a lot for your help @rgwatwormhill and @BramVanroy, finally moving forward! :partying_face:

Hi, I am trying something similar to do “further training” as mentioned above by running the run_mlm.py script in google colab from a checkpoint.

cmd = ‘’‘python run_mlm.py
–output_dir {0}
–model_name_or_path roberta-base
–mlm_probability 0.15 \
–train_file {1}
–validation_file {2}
–config_name /content/models/smallBERTa
–tokenizer_name /content/models/smallBERTa
–do_train
–line_by_line
–overwrite_output_dir
–do_eval
–learning_rate 1e-4 \
–num_train_epochs 5
–save_total_limit 2
–save_steps 2000
–logging_steps 500
–per_device_eval_batch_size 32
–per_device_train_batch_size 32
–seed 42
‘’’.format(weights_dir,train_path,eval_path)

Run:
!{cmd}

Problem is, I have the error:

Traceback (most recent call last):
File “run_mlm.py”, line 386, in
main()
File “run_mlm.py”, line 154, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File “/usr/local/lib/python3.6/dist-packages/transformers/hf_argparser.py”, line 135, in parse_args_into_dataclasses
obj = dtype(**inputs)
File “”, line 12, in init
File “run_mlm.py”, line 133, in post_init
raise ValueError(“Need either a dataset name or a training/validation file.”)
ValueError: Need either a dataset name or a training/validation file.
/bin/bash: line 1: --train_file: command not found
/bin/bash: line 2: --num_train_epochs: command not found

I have no idea if the error is because of some bash configuration i did not set on google colab, or if the python script was written wrongly. Appreciate if anyone encountered the same issue and know what is going on!

Hi @Max, I think the source of your error is that you’re missing some line-break characters \ in your cmd string. In other words, put a backslash at the end of each line in the script command:

python run_mlm.py \
--output_dir {0} \
--model_name_or_path roberta-base \
...

There might also be something quirky happening with the string formatting, so I’d check that it makes sense by printing out the cmd string