Got it
enable_model_cpu_offload()
really boosts it! In my case from 15s/it to 1s/it
See Speed up inference