Hello,
I’ve come across the "How to train a new language model from scratch using Transformers and Tokenizers " blog post and wanted to perform 1 epoch of pretraining on a already pretrained model.
My goal was to use it later on to perform semantic search.
My code (adapted) was:
from transformers import AutoTokenizer
from transformers import AutoModelForPreTraining
from transformers import LineByLineTextDataset
from transformers import DataCollatorForLanguageModeling
model = AutoModelForPreTraining .from_pretrained('neuralmind/bert-large-portuguese-cased')
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-large-portuguese-cased', do_lower_case=False)
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="dataForModel.txt",
block_size=128,
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./newModel",
overwrite_output_dir=True,
num_train_epochs=1,
per_gpu_train_batch_size=64,
save_steps=10_000,
save_total_limit=2,
prediction_loss_only=True,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
trainer.train()
I receive the error:
TypeError: forward() got an unexpected keyword argument 'labels'
Any help understanding what am I doing wrong?
Thanks in advance
Rui