Hey there,
I apologize in advance if the question below is simple but I’m new to transformers and I want to make sure I get things right before wasting my GPU time training the “wrong” model.
The goal: I want to train a domain-specific roberta model, building on the pre-trained roberta model, therefore starting from roberta-base’s weights rather than from scratch.
The issue: I first followed your tutorial , before realizing that the weights were not initialized on roberta-base’s before training.
My question: What are the correct steps to train a domain-specific model on-top of roberta-base?
- Train a ByteLevelBPETokenizer on my data
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=50_000, min_frequency=2,
special_tokens=[“<s>”, “<pad>”, “</s>”, “<unk>”, “<mask>”])
tokenizer.save_model(“mymodel”)’
and use it to preprocess my data, FYI I have 700,000 sentences stored in a txt file, one sentence per line.
from transformers import LineByLineTextDataset
training_dataset = LineByLineTextDataset(tokenizer=tokenizer, file_path=“data/training.txt”, block_size=128,)
evalutation_dataset = LineByLineTextDataset(tokenizer=tokenizer, file_path=“data/testing.txt”, block_size=128,)
- Get the roberta config
from transformers import RobertaConfig, RobertaForMaskedLM
config = RobertaConfig( vocab_size=52_000, max_position_embeddings=514,
num_attention_heads=12, num_hidden_layers=12, type_vocab_size=1,)
config.save_model(“mymodel”)
model = RobertaForMaskedLM(config=config)
Or should I instead use:
from transformers import RobertaForMaskedLM
model= RobertaForMaskedLM.from_pretrained(‘roberta-base’)
- Get data Collaor
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
- Set up training arguments and train
from transformers import Trainer, TrainingArguments, EvalPrediction
What I currently have:
training_args = TrainingArguments(
output_dir=“./mymodel”,
evaluation_strategy=“steps”,
prediction_loss_only=True,
per_device_train_batch_size= 32,
per_device_eval_batch_size=32,
eval_accumulation_steps = 200,
weight_decay=0.01,
adam_epsilon=1e-6,
max_steps=200000,
warmup_steps=1,
save_steps=200,
save_total_limit=5,
eval_steps= 100,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=training_dataset,
eval_dataset=evalutation_dataset,
)
trainer.train()
Should I instead use:
python run_mlm.py
–model_name_or_path roberta-base
–train_file path_to_train_file
–validation_file path_to_validation_file
–do_train
–do_eval
–output_dir /tmp/test-mlm
Thanks in advance for any help you can provide!