Max_step and generative dataset

I am fine tuning BERT for my domain using both MLM and NSP. I have used the TextDatasetForNextSentencePrediction for dataset along with DataCollatorForLanguageModeling for MLM.

The problem is with TextDatasetForNextSentencePrediction as it loads the entire data into the memory. So I tweaked it a bit and now it generates the data rather than storing everything in the memory.

However the problem is with the max_step for training. Now that it is generative, the __len__ property can not be added to TextDatasetForNextSentencePrediction and I am getting an error regarding the same thing:

ValueError: train_dataset does not implement __len__, max_steps has to be specified

Can I just specify an upper bound and it will figure it out automatically? Or will it break? Because if it does, I can’t run multiple epochs. (right)?

There are a lot of X, Y questions here and I can see this question is rather open ended. But any sort of advice or direction is much appreciated.