How to fine-tune a model from the Hub on a large dataset?

mpjan · November 2, 2022, 10:49pm

I’m trying to fine-tune this Sentence Transformers model from the Hub on the Portuguese subset of this dataset.

The dataset is fairly large (1 million triplets) and I’m running into memory issues in Google Colab. What is the best alternative here?

So far I can think of these three, but I’m not sure which is the best:

Fine-tune using a streaming dataset. Is this possible?
Fine-tune using a smaller subset of the dataset. But I’m running into the same memory issues using even just 1% of the dataset…
Pay for Google Colab Pro. But I’m not sure this will be enough.

Any suggestions?

IdoAmit198 · November 3, 2022, 8:20am

Regards 1, didn’t try it myself but from reading in the load_dataset documentation it does seems like streaming dataset is possible by passing streaming=True.
Maybe you can read about it and try that.

mpjan · November 3, 2022, 12:31pm

Yes, it is possible to read a dataset in streaming mode, what I don’t know if it’s possible to fine-tune a model using a streaming dataset

Topic		Replies	Views
Cost to fine tune large transformer models on the cloud? Beginners	1	1520	November 29, 2021
How to load large dataset with streaming mode and prepare for training? 🤗Datasets	10	4193	November 3, 2023
How to use the fine-tuned model for actual prediction after re-loading it Beginners	5	14472	August 10, 2022
Best practices for a large dataset 🤗Datasets	7	1325	May 6, 2025
Fine-tune, or train from scratch? Beginners	6	3454	September 16, 2020

How to fine-tune a model from the Hub on a large dataset?

Related topics