RAG on HF Inference for Pros - using Llama 2 + Llama 2 embeddings model

Hi,

I’m looking to use Hugging Face Inference for Pros along with one of the Llama 2 models + one of the Llama 2 embeddings model for one of my prototypes for Retrieval-Augmented Generation (RAG).

My concerns about this approach include

  • Are the models above compatible with each other?
  • Can Inference for Pros handle them, considering that the main model has 70B params?
  • Is there a better embeddings model to use with Llama 2 LLMs? Or is using Llama.cpp locally for embeddings a better approach?

Please advise.

Thanks & Regards,
Deekshith