Fine tuning bert on next sentence prediction task

zenonas009 · September 13, 2020, 12:15pm

I am trying to fine-tune Bert using the Huggingface library on next sentence prediction task. I looked at the tutorial and I am trying to use DataCollatorForNextSentencePrediction and TextDatasetForNextSentencePrediction . When I am using that I get the following error(use the pastebin link to see the error)https://pastebin.pl/view/bde2c3d4. I have provided my code bellow.

============Code================
def train(bert_model,bert_tokenizer,path,eval_path=None):
out_dir = “/content/drive/My Drive/next_sentence/”
training_args = TrainingArguments(
output_dir=out_dir,
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=30,
save_steps=10000,
save_total_limit=2,
)

data_collator = DataCollatorForNextSentencePrediction(
tokenizer=bert_tokenizer,mlm=False,block_size=512,nsp_probability =0.5
)

  dataset = TextDatasetForNextSentencePrediction(
    tokenizer = bert_tokenizer,
    file_path=path,
    block_size=512,
  )
   
  trainer = Trainer(
      model=bert_model,
      args=training_args,
      train_dataset=dataset,
      data_collator=data_collator,
      
  )
  trainer.train()
  trainer.save_model(out_dir)
def main():
  print("Running main")
  bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
  bert_model = BertForNextSentencePrediction.from_pretrained("bert-base-cased")
  train_data_set_path = "/content/drive/My Drive/next_sentence/line_data_set_file.txt"
  train(bert_model,bert_tokenizer,train_data_set_path)
  #prepare_data_set(bert_tokenizer)
main()

nbroad · September 14, 2020, 2:13am

Can you fix the formatting in your post? It would make it easier to read

zenonas009 · September 14, 2020, 8:13am

I added the error log to a pastebin link. Might be easier to read it that way

vblagoje · September 29, 2020, 12:27pm

Hey @zenonas009 did you make any progress on the training?

zenonas009 · September 29, 2020, 12:47pm

Hello @vblagoje. Yes , I found a couple of problems and managed to fix them.

I was not creating the text file in a correct way so the dataset class couldn’t parse it and create the examples. That was causing the error I have on the pastebin link.
The datacollator class seems to return “masked_lm_labels” which I don’t know why and what is the relevance when doing nsp. I removed that key and everything seems to be working fine.
Also I noticed that datacollator will return “next_sentence_label” instead of “labels”.That seemed a problem for the Trainer class so I changed “next_sentence_label” into “labels”.

I just saw the source code, copy it locally and made the changes locally.

I am currently getting the progress bars when training starts and I am waiting for it to finish training.

vblagoje · September 30, 2020, 12:08pm

How did you prepare 17Gb dataset and then continually feed it to the model? That’s where I got stuck.

Topic		Replies	Views
Fine-tuning BERT Model on domain specific language and for classification 🤗Transformers	7	8427	November 14, 2024
Cannot get DataCollator to prepare tf dataset 🤗Transformers	0	477	July 15, 2022
How to fine-tune BERT model for next word prediction? Beginners	0	1113	October 3, 2021
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12844	February 12, 2024
Using Trainer for BertForPretraining does not work 🤗Transformers	1	1349	April 6, 2022

Fine tuning bert on next sentence prediction task

Related topics