AttributeError for Text Dataset For Next Sentence Prediction: no attribute 'documents'

llaurabat · July 26, 2021, 10:46am

Hello,

I am preparing some data for BERT fine-tuning for Next Sentence Prediction.
In my understanding, the way to go is:

Pass the data through the TextDatasetForNextSentencePrediction class
Instantiate DataCollatorForLanguageModeling (as DataCollatorForNextSentencePrediction has been removed)
Pass both to the Trainer

However, there seems to be some bug with the attribute documents to the class TextDatasetForNextSentencePrediction. Indeed, while for some runs the attribute is produced, for others I get AttributeError: 'TextDatasetForNextSentencePrediction' object has no attribute 'documents' reporting that the attribute is missing.
I also see a warning saying that soon enough this code will be replaced. I was thus wondering whether this is still the pipeline you suggest or I should go another way.

Below sample data and minimal code to reproduce the error.

Screenshot 2021-07-26 at 12.38.50

from transformers.data.datasets.language_modeling import TextDatasetForNextSentencePrediction
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
train_dataset = TextDatasetForNextSentencePrediction(
    tokenizer=tokenizer,
    file_path="test.txt",
    block_size=512,
)

docs = train_dataset.documents

Topic		Replies	Views
Fine tuning bert on next sentence prediction task Intermediate	5	4044	September 30, 2020
AttributeError: 'Flaubert For Sequence Classification' object has no attribute 'predict' 🤗Transformers	2	3216	December 20, 2021
Using Trainer for BertForPretraining does not work 🤗Transformers	1	1349	April 6, 2022
Why Text Dataset For Next SentencePrediction get “Run out of input” error? 🤗Transformers	0	661	June 4, 2022
Language training on a model Beginners	1	402	August 27, 2023

AttributeError for Text Dataset For Next Sentence Prediction: no attribute 'documents'

Related topics