hi,
I would like to calculate embeddings using a Llama-2 model and HuggingFaceEmbedding embedding class:
from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="meta-llama/Llama-2-7b-chat-hf")
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])
But I get the following exception which I dont know how to bypass:
Using pad_token, but it is not set yet.
Traceback (most recent call last):
File "/home/ttpuser/chatgpt/embeddings-test/test.py", line 4, in <module>
embeddings = embed_model.get_text_embedding("Hello World!")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ttpuser/.pyenv/versions/3.11.5/lib/python3.11/site-packages/llama_index/embeddings/base.py", line 185, in get_text_embedding
text_embedding = self._get_text_embedding(text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ttpuser/.pyenv/versions/3.11.5/lib/python3.11/site-packages/llama_index/embeddings/huggingface.py", line 184, in _get_text_embedding
return self._embed([text])[0]
^^^^^^^^^^^^^^^^^^^
File "/home/ttpuser/.pyenv/versions/3.11.5/lib/python3.11/site-packages/llama_index/embeddings/huggingface.py", line 146, in _embed
encoded_input = self._tokenizer(
^^^^^^^^^^^^^^^^
File "/home/ttpuser/.pyenv/versions/3.11.5/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2602, in __call__
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ttpuser/.pyenv/versions/3.11.5/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2688, in _call_one
return self.batch_encode_plus(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ttpuser/.pyenv/versions/3.11.5/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2870, in batch_encode_plus
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ttpuser/.pyenv/versions/3.11.5/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2507, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
Does someone knows how to go around this?
Thank you a lot!
Ruben