We have a resource pool of GPU servers, and we are likely to deploy user-specified models on them. But in our scenario, servers might need to frequently change the model they serve.
In a common usage, one needs to download model weight to local disks before loading it on devices like GPUs. In our scenario above it might be inefficient and occupy too much space on disks.
So I’m really wondering if somehow it can be achieved that, we might be able to load model weight directly onto devices(like straightly pull from the hub, cache in memory, then load onto device)?
Thanks for your advice.