Cost-effective Cloud Environments for Training

Hi there,

First off, thank you so much to everybody who has asked questions on these forums, answered those questions, and, of course, to everybody at Hugging Face who has made machine learning accessible to folks like me who are trying to jump into ML. The libraries, the example datasets, the ability to download models so easily, the YouTube channel, the answers to questions on GitHub, everything, it is just invaluable.

Anyways, I am feeling pretty good about the training code set up we have now for a very exciting machine learning task, creating a dental chatbot for dentists and dental professionals using 1.5 million training inputs comprising of forum discussions and podcast transcripts.

My question: what is the best cloud environment?

Obviously this is not a simple question, as there are a lot of factors to consider, so let me narrow it down a bit.

We are currently using an Azure NC24s v3 instance, and it will take us about 2 weeks to train using this configuration.

I am starting to compare prices to SageMaker and other options, but I wanted to see if maybe there was some obscure environment or something I am not aware of that we should consider.

I am also considering testing with an Azure ND96asr v4 instance as even though it’s more expensive, it appears to be significantly more powerful and so could potentially train the LLM faster and non-linearly to cost, and so more bang for our buck overall.

(Also I am trying to figure out what Azure’s ML Studio even is, if it is just an interface or if it provides VM access as well)

Some fun findings today!

Here is a comparison of the prices and specs between Azure VMs and AWS SageMaker:

But more importantly, here is a breakdown of all the 64+ GiB RAM models, but sorted by the hourly price per GiB of GPU RAM:

I just thought this was neat, because ND96amsr has 10x as much GPU RAM than NC24s, but is “only” 2.6x as expensive, and so should be cheaper to run on a large dataset in the long run given it should hopefully take significantly less time to train. Also, we were having to keep our training inputs token lengths at about 1,000, because the time needed to train scales quadratically and not linearly, so bumping it up to 2,000 would take almost four times as long, and so this way with a more powerful machine we can hopefully bump the training input lengths from 1,000 to 2,000 tokens long while still staying in our comfort zone of cost and time.

Anyways, just wanted to share because I thought these comparisons were neat! I am really curious to run an experiment as to whether or not the instance with 10x GPU RAM will train 10x as fast, or if it will be slightly less than that because of synchronization.

Let me know if you think there are better options we should consider!