I am having issues getting the Trainer() function in huggingface to actually do anything on Vertex AI workbench notebooks.
I’m totally stumped and have no idea how to even begin to try debug this.
I made this small notebook: colabs/huggingface_text_classification_quickstart.ipynb at master · andrewm4894/colabs · GitHub
If you set framework=pytorch
and run it in colab it runs fine.
I wanted to move from colab to something more persistent so tried Vertex AI Workbench notebooks on GCP. I created a user managed notebook (PyTorch:1.11, 8 vCPUs, 30 GB RAM, NVIDIA Tesla T4 x 1) and if i try run the same example notebook in jupyterlab on the notebook it just seems to hang on the Trainer()
call and do nothing.
It looks like the GPU is not doing anything either for some reason (it might not be supposed to since i think Trainer() is some pretraining step):
(base) jupyter@pytorch-1-11-20220819-104457:~$ nvidia-smi
Fri Aug 19 09:56:10 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 41C P8 9W / 70W | 3MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I found this thread that maybe seems like a similar problem so i played with as many Trainer()
args as i could but no luck.
So im kind of totally blocked here - i refactored the code to be able to use Tensorflow which does work for me (after i installed tensorflow on the notebook) but its much slower for some reason.
Basically this was all working great (in my actual real code im working on) on colab’s but when i tried to move to Vertex AI Notebooks i seem to be now blocked by this strange issue.
Any help or advice much appreciated, i’m new to HuggingFace and Pytorch etc too so not even sure what things i might try or ways to try run in debug etc maybe.