I’m trying to use a custom container for inference but I cannot see anywhere how to set the args for the query that starts the container’s serving? Anyone know what i am missing?
e.g. i need to set things like this:
f"--model={model_id}",
f"--tensor-parallel-size={accelerator_count}",
"--swap-space=16",
f"--dtype={dtype}",
"--gpu-memory-utilization=0.9",