Max_step and generative dataset

pkoiralap · November 5, 2021, 3:43pm

I am fine tuning BERT for my domain using both MLM and NSP. I have used the TextDatasetForNextSentencePrediction for dataset along with DataCollatorForLanguageModeling for MLM.

The problem is with TextDatasetForNextSentencePrediction as it loads the entire data into the memory. So I tweaked it a bit and now it generates the data rather than storing everything in the memory.

However the problem is with the max_step for training. Now that it is generative, the __len__ property can not be added to TextDatasetForNextSentencePrediction and I am getting an error regarding the same thing:

ValueError: train_dataset does not implement __len__, max_steps has to be specified

Can I just specify an upper bound and it will figure it out automatically? Or will it break? Because if it does, I can’t run multiple epochs. (right)?

There are a lot of X, Y questions here and I can see this question is rather open ended. But any sort of advice or direction is much appreciated.

Thanks

Topic		Replies	Views
Identifying max_steps for generativeText Dataset For Next SentencePrediction Intermediate	0	768	November 5, 2021
Question on Next Sentence Prediction Beginners	1	1706	May 2, 2021
Using Trainer for BertForPretraining does not work 🤗Transformers	1	1348	April 6, 2022
Streaming dataset into Trainer: does not implement __len__, max_steps has to be specified 🤗Datasets	6	4469	March 21, 2023
Next sentence prediction on custom model 🤗Transformers	3	3388	May 14, 2024

Max_step and generative dataset

Related topics