Using a paid inference end point to query llamaindex knowledge graph gives worse results than the free inference api


I have successfully followed this: and used it to read a 2 page pdf. When I submit queries, the results are very good (well, as good as I need).

To speed things up to use a larger document I have created an inference endpoint based on HuggingFaceH4/zephyr-7b-beta and access it via

llm = HuggingFaceInferenceAPI(
    model_name="", token=hf_token

If I now provide a 20 page pdf, submitting the same query now gives me poor results, often resulting in responses which inform that there is no relevant data in the document even though the 2 pages that I used originally are contained in the larger document.

I have tried using locally hosted LLMs but as you can imagine they are too slow on my machine.

Can anyone give me a clue as to why using the same model, but on a paid endpoint should give worse results?

Hi ,
I am also trying to follow the same medium article and i am facing this error "ModuleNotFoundError: No module named ‘HuggingFaceInferenceAPI’ " , how did u solve this ?



A recent update to llama index requires you to call it like this:

pip install llama-index-llms-huggingface
from llama_index.llms.huggingface import HuggingFaceInferenceAPI
1 Like