Embeddings via API fundamental doubts

Hi community,

I’m new here and I have some doubts wrt embeddings via HF Inference API & other methods and architecture level doubts. Absolutely any help is truly appreciated. I have been learning a ton, thanks for your help in advance.

I finetuned a model with qlora & hosted on HF (here) and need to get embeddings with it.

  • firstly, gpt4 told me this: "we often use the last layer's outputs when we're looking for rich, contextual embeddings." plus a similar blog (here)
    The BERT base model uses 12 layers of transformer encoders as discussed, and each output per token from each layer of these can be used as a word embedding!. Perhaps you wonder which is the best, though?
    told similar. Is it factually correct? as i doubt hallucinations.

  • now, considering the above is correct, which layer’s embeddings does the HF Inference API provide? HF tells me I can use Inference API like so:

import requests
model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = "get your token in http://hf.co/settings/tokens"

api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}
def query(texts):
   response = requests.post(api_url, headers=headers, json={"inputs": texts, "options":{"wait_for_model":True}})
   return response.json()

but trying with my 7b model didn’t load in colab (it never ran out of memory but just kept running). note: i haven't tried on a rented gpu yet (maybe that works)

  • Also, I found no way to use a quantised model (TheBloke’s GGML/GGUF) for getting embeddings from the Inference API (help me if a way exists)

  • On a side notice, I also tried quantised model to generate embeddings using llama.cpp using TheBloke/llama-2-7b-GGUF with embedding command just like we can do inference with main & it works but I suppose that since its quantised to say 4 or 8 bits, the embeddings would also be less precise right and wont be exact right?

Thanks for bearing with my stupid doubts and again, any help is truly appreciated. thanks :smiley: