Can I keep model in GPU vram and then iterate and change my program without re-uploading?

fredrum · February 19, 2024, 12:10am

Hello!

Because its quite slow to load the model to GPU vram I was wondering if there’s a way I can keep the model (falcon7B atm) loaded into the GPU and then access that from a different python program? Different from the one that uploaded it?

I basically want to keep the model ready to receive but also be able to keep changing the code as usual in pycharm and hit the execute button repeatedly without having to re-upload.

Is this possible to do?

Cheers!
Fred

dblakely · February 20, 2024, 7:19pm

When I do this type of thing, I usually use the TGI from Huggingface to load the model and then send it requests with the python code that I’m working on.

First a bash runner script to launch the TGI instance:

model_name_or_path="tiiuae/falcon-7b"
port=2232

docker run \
    --rm \
    -it \
    --gpus '"device=0"' \
    -p $port:80 \
    -v /data:/data \
    ghcr.io/huggingface/text-generation-inference:1.0.3 \
    --model-id $model_name_or_path \
    --sharded false \
    # see other args you may need to specify in their docs

And then some python to call the model:

from text_generation import Client

TGI_URL = "http://0.0.0.0:2232" # same port as specified in the runner script above

model = Client(TGI_URL)
outputs = model.generate(prompt, max_new_tokens=128, top_p=0.9) # other gen args can be passed here
print(f"Outputs: {outputs.generated_text}")

And then you can repeatedly call the model and change your python code without reloading the model over and over. Plus TGI optimizes inference so it’s usually much faster than calling the model directly yourself

fredrum · February 21, 2024, 3:12am

Hi and thank you so much for the tip!
I will try this out!

Cheers!

Topic		Replies	Views
Simple example run takes 5+ minutes on rtx3060 - falcon7B Beginners	1	502	February 18, 2024
API Rest with several models loaded using GPU but not at same time Beginners	1	401	June 10, 2021
Loading model directly to GPU omitting RAM Beginners	6	81	March 28, 2025
Downloads Model Every Time Spaces	4	1313	November 2, 2023
How to avert 'loading checkpoint shards'? 🤗Transformers	4	12908	November 1, 2024

Can I keep model in GPU vram and then iterate and change my program without re-uploading?

Related topics