Can I keep model in GPU vram and then iterate and change my program without re-uploading?

Hello!

Because its quite slow to load the model to GPU vram I was wondering if there’s a way I can keep the model (falcon7B atm) loaded into the GPU and then access that from a different python program? Different from the one that uploaded it?

I basically want to keep the model ready to receive but also be able to keep changing the code as usual in pycharm and hit the execute button repeatedly without having to re-upload.

Is this possible to do?

Cheers!
Fred

1 Like

When I do this type of thing, I usually use the TGI from Huggingface to load the model and then send it requests with the python code that I’m working on.

First a bash runner script to launch the TGI instance:

model_name_or_path="tiiuae/falcon-7b"
port=2232

docker run \
    --rm \
    -it \
    --gpus '"device=0"' \
    -p $port:80 \
    -v /data:/data \
    ghcr.io/huggingface/text-generation-inference:1.0.3 \
    --model-id $model_name_or_path \
    --sharded false \
    # see other args you may need to specify in their docs

And then some python to call the model:

from text_generation import Client

TGI_URL = "http://0.0.0.0:2232" # same port as specified in the runner script above

model = Client(TGI_URL)
outputs = model.generate(prompt, max_new_tokens=128, top_p=0.9) # other gen args can be passed here
print(f"Outputs: {outputs.generated_text}")

And then you can repeatedly call the model and change your python code without reloading the model over and over. Plus TGI optimizes inference so it’s usually much faster than calling the model directly yourself

2 Likes

Hi and thank you so much for the tip! :slight_smile:
I will try this out!

Cheers!

1 Like