I am fine tuning BERT for my domain using both MLM and NSP. I have used the TextDatasetForNextSentencePrediction
for dataset along with DataCollatorForLanguageModeling
for MLM.
The problem is with TextDatasetForNextSentencePrediction
as it loads the entire data into the memory. So I tweaked it a bit and now it generates the data rather than storing everything in the memory.
However the problem is with the max_step for training. Now that it is generative, the __len__
property can not be added to TextDatasetForNextSentencePrediction
and I am getting an error regarding the same thing:
ValueError: train_dataset does not implement __len__, max_steps has to be specified
Can I just specify an upper bound and it will figure it out automatically? Or will it break? Because if it does, I can’t run multiple epochs. (right)?
There are a lot of X, Y questions here and I can see this question is rather open ended. But any sort of advice or direction is much appreciated.
Thanks