I’m following this tutorial. I’ve simply copy-pasted the code verbatim into a local Jupyter notebook.
When I run trainer.train() I get this error:
Trainer is attempting to log a value of “[nan, 0.22304851875478263, 0.9226045169903048, 0.0, 0.00034364829394438505, 0.0002366915040712859, nan, 0.00014368284700068667, 0.0, 0.0, 0.8146361679778702, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.8902991532345663, 0.0, 0.00042407633933944864, 0.0, 0.0, nan, 0.0, 0.0, 0.0, 0.0, 0.9226799895754838, 0.0016839565374974338, 0.4636751073508497, 0.0, 0.0, 0.00042719251601971937, 0.0]” of type <class ‘list’> for key “eval/per_category_accuracy” as a scalar. This invocation of Tensorboard’s writer.add_scalar() is incorrect so we dropped this attribute.
However, we made some updates already (we now store metrics in the evaluate library rather than the datasets library). I’ll update our blog accordingly.
I was able to train nvidia/mit-b0 on your segments/sidewalk-semantic demo dataset, and it works well.
The training time is about 6 hours for 50 epochs. This is on an RTX 3090 and a Ryzen 5 3600X with 12 cores. I suspect the bottleneck is train_transforms and val_transforms which are single-threaded. Before I go ahead and try to convert those to multiprocessing, is my assumption right?
Any suggestions for converting feature extraction and augmentation to multiprocessing? If I instantiate a pool within train_transforms() and val_transforms() every time and pass the images and labels to the pool for processing, then collect the results, I feel I may lose some time creating the pool every time the functions are invoked. Is there a better way?
I do not want to pre-convert all images, since I want the model to benefit from random augmentations each epoch.
Please disregard the previous message. I’ve found TrainingArguments(dataloader_num_workers=N) and I set that to the number of CPU cores on my system. Now training is an order of magnitude faster.
It is still very slow when it gets to ***** Running Evaluation *****, and neither CPU nor GPU are fully utilized at that step. There’s one CPU core at 100%, and the GPU is barely doing 20% compute.
Is there any way to speed up evaluation?
I also see this warning very often, which may or may not be related:
.local/lib/python3.10/site-packages/transformers/data/data_collator.py:131: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at …/torch/csrc/utils/tensor_new.cpp:201.)
batch[k] = torch.tensor([f[k] for f in features])
After using multiprocessing for both the dataloader and the evaluation, training time for 50 epochs went from 6 hours to 40 minutes.
I think even simple examples such as the code on your blog should default to multiprocessing. It’s just a couple extra arguments, and it makes the whole thing much faster.
I run the notebooks in VSCode, either locally (on my gaming PC with a fast GPU), or remote via the SSH plugin (VSCode is on the laptop, the Jupyter kernel runs on my gaming PC). There is a difference in the way the progress bars are displayed; in the remote sessions the progress bars are shown differently, and I also seem to get more errors.
I do not have an explanation for it, and it could still be a coincidence. But for now I’ve switched to running everything locally and I do not see major issues anymore.
Did you use metric.compute or metric._compute when calculating metrics? I’m seeing a massive speedup when using metric._compute. I’ve reported this to the evaluate team and they’re looking into it.