I am preparing some data for BERT fine-tuning for Next Sentence Prediction.
In my understanding, the way to go is:
- Pass the data through the
DataCollatorForLanguageModeling(as DataCollatorForNextSentencePrediction has been removed)
- Pass both to the Trainer
However, there seems to be some bug with the attribute
documents to the class
TextDatasetForNextSentencePrediction. Indeed, while for some runs the attribute is produced, for others I get
AttributeError: 'TextDatasetForNextSentencePrediction' object has no attribute 'documents' reporting that the attribute is missing.
I also see a warning saying that soon enough this code will be replaced. I was thus wondering whether this is still the pipeline you suggest or I should go another way.
Below sample data and minimal code to reproduce the error.
from transformers.data.datasets.language_modeling import TextDatasetForNextSentencePrediction from transformers import BertTokenizerFast tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased') train_dataset = TextDatasetForNextSentencePrediction( tokenizer=tokenizer, file_path="test.txt", block_size=512, ) docs = train_dataset.documents