we have a dataset for Turkish language with 35GB. we’ve pre processed and cleaned the whole text. but we have no gpu, no muscle computer thats why we hope maybe colab pro can make it happen.
we want to pre-train a BERT, Longformer, BigBird and GPT-2.
When we start tokenize text to train colab collapse. We are looking for a complete guide to train theese models via checkpoints and availabe to colab pro.
No google cloud suggestions please. It costs more than 7k$'s. And we are just independent student researchers with pocket money.
So please guide us. We will share models via huggingface so it is a win win situation.
Kind regards
emre
Yes, I have been stuck on this issue for a long time and I still haven’t been able to solve it. I’m open to any ideas that might help.
That’s a great topic! I’m trying to figure out how to do this for like weeks. And unfortunately i didn’t come across many useful resources. So I would be glad if there is any suggestions - even basic ones!
Oh very expensive indeed! This is topic that I am very curious about as well. Can it be solved more cost-effectively?
you can use TPU’s provided by Colab Pro - they run for a sizeable amount of time if you prod every few hours or so and store checkpoints.
The most problematic aspect would be the model size - so you would have to implement model parallelism to trade-off for training time in order to achieve good accuracy.
If you want good accuracy + almost free compute, then check out TFRC (TPU Research Cloud) they give a shit-ton of free TPU hours all for filling up a google form (and since you are students, it would be even more easier for you). The only catch is that you have to attach a credit card though you wont have to pay anything.
Have a fantastic day!
I was thinking of something similar to this. AFAIK, there are two problems in Colab. First, you have just two cores, which make the tokenization of very big datasets a bit slow. Let’s say you do so and save it to disk… then you have to face the second issue, IO performance of disks in colab-- check this issue Slow dataloading with big datasets issue persists · Issue #2252 · huggingface/datasets · GitHub
Performing something more modest –finetuning a BERT model on a dataset of some gigabytes– is still problematic because of this.
? I get 32 CPU cores from TPU runtime in Colab. and IO performance of the disks in Colab are quite decent enough to load the batches without bottlenecking too much - in most cases the bottleneck is the compute itself. Network speed it also pretty good at around 100-120 Mbps from Drive so dataset loading doesn’t take too much time.
32 cores? Strange. I’m getting this (in a Colab Pro instance):
!cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU @ 2.30GHz
stepping : 0
microcode : 0x1
cpu MHz : 2299.998
cache size : 46080 KB
physical id : 0
...
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU @ 2.30GHz
stepping : 0
microcode : 0x1
cpu MHz : 2299.998
cache size : 46080 KB
...
Regarding the other, perhaps it’s more like a problem of datasets
. Not quite sure about that.
HTOP output - seems like that to me. 35GB RAM and 40 CPU cores (TPUv2)
LSCPU:-
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping: 0
CPU MHz: 2299.998
BogoMIPS: 4599.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities
@Neel-Gupta I think your output is from connecting to ssh to the colab instance, while what I see (two CPU cores) are what comes from using the web interface.
I’m not sure that using colab as ssh instances is under the TOS of Colab Pro. Morever, I’ve been trying to use them and I get frequent disconnections and shutdowns.
no, I don’t even know how to ssh into colab - I am using the simple web version.
Though perhaps I have access to PRO, so that might be a big factor.
@Neel-Gupta I have access to Pro too. And you are right – lscpu says I have 40 cores available. Strange that cat /proc/cpuinfo
says something different
Regarding the other issue that I mentioned before --IO problems-- it seems more like a current issue in datasets
. Without these two problems, the memory limit is still problematic, even in the 32 GB setting that Colab Pro provides. In some cases, reloading a checkpoint using transformers
consumes a lot of RAM
As you just mentioned before, applying to TRC (TPU Research Cloud) seems to be the best option
@Neel-Gupta Hi. Do you keep the 40 CPU access every day? I have an issue that the allocated CPU is just 2 cores even though I select TPU/High-RAM instance. I’m sure that I could access 40 CPUs until late September, however, 2 CPUs right now.
My Colab is with Pro+ subscription.
are you sure its the 50$ Pro+? then in that case, try installing htop
and seeing how many cores it shows - you can use the terminal feature in the bottom left corner of Colab
Lately we’ve had some so-called AI “artists” which take up most of the GPUs, so you might have trouble until colab puts out a blanket ban on them
Yes. I surely confirmed that I am using Colab Pro+ with 50$. I checked htop
in the bash terminal as you pointed out, still 2 cores. I checked also with lscpu
command.
I suspect a possibility that allocated Colab instance would not have equal performance. Then, I tried “terminate” and “reconnect” 20 times. However, the CPU core was always 2 cores.
I wonder there might exist a blanket ban on Colab as @Neel-Gupta pointed out. Unfortunately, Colab might have classified me as an “AI artist”. If so, my question is what are the criteria and how I can unblock the ban. If someone knows, please let me know.
Oh no, there’s no banning on such uses as of yet
But again, the resources might be strained so you might not get a multi-core CPU for TPU. Make the best of that ig There is no way to circumvent colab allocations since they are very much dependent on the traffic…
Hi, did you find an acceptable solution? I answer (years later) because your task reminds me of my training of GPT2-Medium from scratch on the free Colab with a single Tesla T4. My dataset was too small though, at max was about 140 MB, gradually growing, yet that was too big for a “normal” training on Colab.
Due to the interruptions and the memory overhead, the data for each training session was only slices of the whole dataset, either deterministic or selected with some randomness of position and size within the files, and the checkpoint was saved to the Google Drive frequently. So it was not the normal training process where the whole data is prepared and iterated at once.
Help about Colab suggests that a session can last for up to 12 hours, mine were, I think, in very rare occasions up to 6-7 hours at best, in the summer, at night. For some days there were just 2-3 hours with GPU.
If you are many people, you could share Google Drive storage for the dataset and the checkpoints and try to circularly train it on different accounts, different Internet/IPs (hoping there won’t be penalties for that).
Here’s a tutorial about the training process (however the vocabulary size there was wrong, 50255 instead of 50257, may be problematic for conversion etc.)
(scroll down for the GPT2 part)