Hi,
I am currently building my own RAG application that I have deployed on Hugging Face Space. After using It for a while I have received the following error :
Request failed during generation: Server Error: Out of cache blocks: asked 3117, only 2916 free blocks
That is when I realized that I had no ideas on how the caching was being handled.
I am using the following code to inference the model :
llm = HuggingFaceEndpoint(
repo_id=llm_model,
temperature = temperature,
max_new_tokens = max_tokens,
streaming=True,
task="text2text-generation",
top_k = top_k,
#top_p=0.95,
repetition_penalty=1.0,
)
Can you enlighten me on how caching is done with HuggingFaceEndpoint & HuggingFace Space ?
(Is huggingFace using the data from our cache) ?