The model is “meta-llama/Llama-2-7b-chat-hf”
Im using the sample code, that queries the model with “I liked “Breaking Bad” and “Band of Brothers”…”
When the space runs it downloads the 9GB+ model, taking a while and then outputs a response.
But if I refresh the page or use the embed code the whole process of downloading the model starts again.
I am at a loss to understand why this happens, I have even upgraded to persistent storage.
What space sdk are you using?
I just read the section about setting the HF_HOME variable so restarting and trying that.
OK I now have the model persist on disk stopping it from having to be downloaded everytime.
Im running on an A10G and it seems that every time I refresh the page the model needs to be loaded into the GPU memory, which takes a few minutes.
Is there a way to persist the model in memory, I have set the gpu to sleep after an hour.
What was the specific setting you made? I’m having the same issue with the model downloading every time I restart.