Download most used models in container and load them when necessary

Hello,

I’m writing this question to know if there is any better strategy than the one I’m thinking.

So, I have a service that should be able to run different huggingface models (for sentence embedding purpose). This service will be containerised (probably in docker) with a load balancer dealing with spawning more or less instances of it.

The service, as I said will have in it multiple models from the hub, and, to simplify, will receive some text to work on and the name of the model to use for it, in the 95% of the cases this model is one out of 5 models we already know, in the other 5% it’s something else:

My idea would be:

Initialise a dict with the 5 models:


for model_name in list_of_most_used_models:
     tokenizer_dict[model_name] = AutoTokenizer.from_pretrained(model_name)
     model_dict[model_name] = AutoModel.from_pretrained(model_name).to(self.device)

Then, everytime there is a new request with another model I do something like:

if model_name in model_dict:
     model_dict[model_name].do_something()
else:
     model_dict[model_name] = ....
     

My idea was to create the docker with the 5 models predownloaded (so that when the instance of the container is created they are already there (and i do not have to download them from the hub). My question is, if I “predownload” them, can I still call them like:

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-180b")

or do I have to do some operation to let AutoTokenizer know that he can use the local version? When I download them do I have to specify a particular directory?

More in general, is there a better way to do what I want or this should work quite well already?