Caching when using HuggingFace Endpoint

EliottBrain · April 15, 2024, 1:08pm

Hi,

I am currently building my own RAG application that I have deployed on Hugging Face Space. After using It for a while I have received the following error :

Request failed during generation: Server Error: Out of cache blocks: asked 3117, only 2916 free blocks

That is when I realized that I had no ideas on how the caching was being handled.
I am using the following code to inference the model :

llm = HuggingFaceEndpoint(
            repo_id=llm_model, 
            temperature = temperature,
            max_new_tokens = max_tokens,
            streaming=True,
            task="text2text-generation",
            top_k = top_k,
            #top_p=0.95,
            repetition_penalty=1.0,
        )

Can you enlighten me on how caching is done with HuggingFaceEndpoint & HuggingFace Space ?

(Is huggingFace using the data from our cache) ?

Topic	Replies	Views
Model caching and locking Models	1079	June 22, 2023
Host Models on Hugging face and Perform Inference on Hugging Face Infrastructure Beginners	18	December 20, 2024
Need help with hugging face API endpoint. ModelError: code "400" Beginners	336	March 3, 2024
Retrieval Augmented Generation using Transformer Eco System 🤗Transformers	465	October 12, 2023
Am I Black Listed By hugging face hub? 🤗Hub	1449	February 13, 2023

Caching when using HuggingFace Endpoint

Related topics