How to fine-tune a model from the Hub on a large dataset?

I’m trying to fine-tune this Sentence Transformers model from the Hub on the Portuguese subset of this dataset.

The dataset is fairly large (1 million triplets) and I’m running into memory issues in Google Colab. What is the best alternative here?

So far I can think of these three, but I’m not sure which is the best:

  1. Fine-tune using a streaming dataset. Is this possible?
  2. Fine-tune using a smaller subset of the dataset. But I’m running into the same memory issues using even just 1% of the dataset…
  3. Pay for Google Colab Pro. But I’m not sure this will be enough.

Any suggestions?

Regards 1, didn’t try it myself but from reading in the load_dataset documentation it does seems like streaming dataset is possible by passing streaming=True.
Maybe you can read about it and try that.

Yes, it is possible to read a dataset in streaming mode, what I don’t know if it’s possible to fine-tune a model using a streaming dataset