How to Train Model Using CPU with MultiProcess Each With Some Number of Thread?

I can’t seems find the definite answer to train model using CPU but not 1 thread per process. Because after I do this to script that using Trainer to train

torchrun --nproc-per-node cpu my_script.py --no_cuda

There will be many processes and each process is using only 1 thread. What I want is for example if I have 16 threads in my CPU and I want to make 4 processes, then each process is using 16/4=4 threads. Is there a way to do this? Do I have to manually add torch.set_num_threads(num_thread_per_process) to the script?