Hello,
I am preparing some data for BERT fine-tuning for Next Sentence Prediction.
In my understanding, the way to go is:
- Pass the data through the
TextDatasetForNextSentencePrediction
class - Instantiate
DataCollatorForLanguageModeling
(as DataCollatorForNextSentencePrediction has been removed) - Pass both to the Trainer
However, there seems to be some bug with the attribute documents
to the class TextDatasetForNextSentencePrediction
. Indeed, while for some runs the attribute is produced, for others I get AttributeError: 'TextDatasetForNextSentencePrediction' object has no attribute 'documents'
reporting that the attribute is missing.
I also see a warning saying that soon enough this code will be replaced. I was thus wondering whether this is still the pipeline you suggest or I should go another way.
Below sample data and minimal code to reproduce the error.
from transformers.data.datasets.language_modeling import TextDatasetForNextSentencePrediction
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
train_dataset = TextDatasetForNextSentencePrediction(
tokenizer=tokenizer,
file_path="test.txt",
block_size=512,
)
docs = train_dataset.documents