Hi
I have successfully followed this: https://medium.aiplanet.com/implement-rag-with-knowledge-graph-and-llama-index-6a3370e93cdd and used it to read a 2 page pdf. When I submit queries, the results are very good (well, as good as I need).
To speed things up to use a larger document I have created an inference endpoint based on HuggingFaceH4/zephyr-7b-beta and access it via
llm = HuggingFaceInferenceAPI(
model_name="https://my_endpoint_ref.aws.endpoints.huggingface.cloud", token=hf_token
)
If I now provide a 20 page pdf, submitting the same query now gives me poor results, often resulting in responses which inform that there is no relevant data in the document even though the 2 pages that I used originally are contained in the larger document.
I have tried using locally hosted LLMs but as you can imagine they are too slow on my machine.
Can anyone give me a clue as to why using the same model, but on a paid endpoint should give worse results?
Thanks.