HuggingFace Trainer() does nothing - only on Vertex AI workbench, works on colab

I am having issues getting the Trainer() function in huggingface to actually do anything on Vertex AI workbench notebooks.

I’m totally stumped and have no idea how to even begin to try debug this.

I made this small notebook: colabs/huggingface_text_classification_quickstart.ipynb at master · andrewm4894/colabs · GitHub

If you set framework=pytorch and run it in colab it runs fine.

I wanted to move from colab to something more persistent so tried Vertex AI Workbench notebooks on GCP. I created a user managed notebook (PyTorch:1.11, 8 vCPUs, 30 GB RAM, NVIDIA Tesla T4 x 1) and if i try run the same example notebook in jupyterlab on the notebook it just seems to hang on the Trainer() call and do nothing.

It looks like the GPU is not doing anything either for some reason (it might not be supposed to since i think Trainer() is some pretraining step):

(base) jupyter@pytorch-1-11-20220819-104457:~$ nvidia-smi
Fri Aug 19 09:56:10 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I found this thread that maybe seems like a similar problem so i played with as many Trainer() args as i could but no luck.

So im kind of totally blocked here - i refactored the code to be able to use Tensorflow which does work for me (after i installed tensorflow on the notebook) but its much slower for some reason.

Basically this was all working great (in my actual real code im working on) on colab’s but when i tried to move to Vertex AI Notebooks i seem to be now blocked by this strange issue.

Any help or advice much appreciated, i’m new to HuggingFace and Pytorch etc too so not even sure what things i might try or ways to try run in debug etc maybe.

Hi @andrewm4894 it might be that you need to move the model explicitly to GPU space. See this discussion for more information. I’m not familiar with Vertex AI Workbench, but from your description (and notebook) it seems that GPU is not being used on Vertex.

thanks @wvangils - i will try this if i run into this again. I actually was able to work around the issue by just making a new notebook with the default python image in vertex and then installing huggingface and pytorch myself.

i noticed that if i make a new workbook NumPy/SciPy/scikit-learn 4 vCPUs, 15 GB RAM , NVIDIA Tesla T4 x (instead of the official pytorch one from the dropdown) and install pytorch myself with conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch it all works.

So i’m guessing is some strange vertex bug with that image perhaps - i created a bug here for gcp so will see if someone there can re-create.

https://issuetracker.google.com/issues/243267023

Just linking all in here in case anyone else hits this.

1 Like