I’m trying to use Roberta model for inference on CPU in production environment. The model is trained in python and then exported to a TorchScript model for inference in Java (using the libtorch library). During Inference I see that multiple threads and cores are being utilized. when running on a machine with 40 cores and running htop I noticed that 21 threads are being utilized and the CPU utilization is at around 1700%. Also running on a 16 core machine I get around 450% CPU utilization. According to this doc: CPU threading and TorchScript inference — PyTorch 1.12 documentation the default number of threads used by pytorch for intra-op parallelism is the number of CPU cores. Since I don’t observe these numbers using htop (and other monitoring mechanisms) I wonder whether scripted models use a different default configuration or whether transformers models or specifically Roberta model is configured differently then pytorch’s defaults (e.g.,
MKL_NUM_THREADS is set somewhere).
Can someone please shed light on this for me? Any help would be greatly appreciated!