How to train a gpt2 with colab pro

we have a dataset for Turkish language with 35GB. we’ve pre processed and cleaned the whole text. but we have no gpu, no muscle computer thats why we hope maybe colab pro can make it happen.
we want to pre-train a BERT, Longformer, BigBird and GPT-2.
When we start tokenize text to train colab collapse. We are looking for a complete guide to train theese models via checkpoints and availabe to colab pro.
No google cloud suggestions please. It costs more than 7k$'s. And we are just independent student researchers with pocket money.
So please guide us. We will share models via huggingface so it is a win win situation.
Kind regards


Yes, I have been stuck on this issue for a long time and I still haven’t been able to solve it. I’m open to any ideas that might help.

That’s a great topic! I’m trying to figure out how to do this for like weeks. And unfortunately i didn’t come across many useful resources. So I would be glad if there is any suggestions - even basic ones!

Oh very expensive indeed! This is topic that I am very curious about as well. Can it be solved more cost-effectively?

you can use TPU’s provided by Colab Pro - they run for a sizeable amount of time if you prod every few hours or so and store checkpoints.

The most problematic aspect would be the model size - so you would have to implement model parallelism to trade-off for training time in order to achieve good accuracy.

If you want good accuracy + almost free compute, then check out TFRC (TPU Research Cloud) they give a shit-ton of free TPU hours all for filling up a google form (and since you are students, it would be even more easier for you). The only catch is that you have to attach a credit card though you wont have to pay anything.

Have a fantastic day! :wave:

I was thinking of something similar to this. AFAIK, there are two problems in Colab. First, you have just two cores, which make the tokenization of very big datasets a bit slow. Let’s say you do so and save it to disk… then you have to face the second issue, IO performance of disks in colab-- check this issue Slow dataloading with big datasets issue persists · Issue #2252 · huggingface/datasets · GitHub

Performing something more modest –finetuning a BERT model on a dataset of some gigabytes– is still problematic because of this.

? I get 32 CPU cores from TPU runtime in Colab. and IO performance of the disks in Colab are quite decent enough to load the batches without bottlenecking too much - in most cases the bottleneck is the compute itself. Network speed it also pretty good at around 100-120 Mbps from Drive so dataset loading doesn’t take too much time.

32 cores? Strange. I’m getting this (in a Colab Pro instance):

!cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2299.998
cache size	: 46080 KB
physical id	: 0

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2299.998
cache size	: 46080 KB

Regarding the other, perhaps it’s more like a problem of datasets. Not quite sure about that.

HTOP output - seems like that to me. 35GB RAM and 40 CPU cores (TPUv2)


Byte Order:          Little Endian
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  20
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping:            0
CPU MHz:             2299.998
BogoMIPS:            4599.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-39
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities

@Neel-Gupta I think your output is from connecting to ssh to the colab instance, while what I see (two CPU cores) are what comes from using the web interface.

I’m not sure that using colab as ssh instances is under the TOS of Colab Pro. Morever, I’ve been trying to use them and I get frequent disconnections and shutdowns.

no, I don’t even know how to ssh into colab - I am using the simple web version.

Though perhaps I have access to PRO, so that might be a big factor.

@Neel-Gupta I have access to Pro too. And you are right – lscpu says I have 40 cores available. Strange that cat /proc/cpuinfo says something different

Regarding the other issue that I mentioned before --IO problems-- it seems more like a current issue in datasets. Without these two problems, the memory limit is still problematic, even in the 32 GB setting that Colab Pro provides. In some cases, reloading a checkpoint using transformers consumes a lot of RAM

As you just mentioned before, applying to TRC (TPU Research Cloud) seems to be the best option

1 Like