Hi community,
I’m new here and I have some doubts wrt embeddings via HF Inference API & other methods and architecture level doubts. Absolutely any help is truly appreciated. I have been learning a ton, thanks for your help in advance.
I finetuned a model with qlora & hosted on HF (here) and need to get embeddings with it.
-
firstly, gpt4 told me this:
"we often use the last layer's outputs when we're looking for rich, contextual embeddings."
plus a similar blog (here)
The BERT base model uses 12 layers of transformer encoders as discussed, and each output per token from each layer of these can be used as a word embedding!. Perhaps you wonder which is the best, though?
told similar. Is it factually correct? as i doubt hallucinations. -
now, considering the above is correct, which layer’s embeddings does the HF Inference API provide? HF tells me I can use Inference API like so:
import requests
model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = "get your token in http://hf.co/settings/tokens"
api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}
def query(texts):
response = requests.post(api_url, headers=headers, json={"inputs": texts, "options":{"wait_for_model":True}})
return response.json()
but trying with my 7b model didn’t load in colab (it never ran out of memory but just kept running). note: i haven't tried on a rented gpu yet (maybe that works)
-
Also, I found no way to use a quantised model (TheBloke’s GGML/GGUF) for getting embeddings from the Inference API (help me if a way exists)
-
On a side notice, I also tried quantised model to generate embeddings using
llama.cpp
usingTheBloke/llama-2-7b-GGUF
withembedding
command just like we can do inference withmain
& it works but I suppose that since itsquantised
to say 4 or 8 bits, the embeddings would also beless precise
right and wont be exact right?
Thanks for bearing with my stupid doubts and again, any help is truly appreciated. thanks