Getting text embedding form Falcon model

Sounak · June 9, 2023, 6:22am

I am currently using Falcon model (falcon 7b instruct). Its performance is quite satisfactory. But my question is that can we use this model somehow for creating the embedding of any text document like sentence transformers or text-embedding-ada from OpenAI?
Or this model is purely for text generation which means it cannot be used for text embedding purposes?

Thanks in advance

batman555 · June 14, 2023, 1:16am

I am also looking into this. I tried the inference api method and the passed the input tokens directly both failed.

I was more interested to see the embeddings but i expect it will be extremely big … to even visualize with Colab

sk2mm2 · June 20, 2023, 7:22am

why it failed ? memory problem or function problem ?

exkdc · June 20, 2023, 10:20am

Same issue here. I think it is a function/interface issue I am new to the library so it is possible I’m doing a tivial mistake but embedding-extraction does not seem to work using a “vanilla” approach.


model_name_falcon = "tiiuae/falcon-7b"

tokenizer_falcon = transformers.AutoTokenizer.from_pretrained(
    model_name_falcon,
    use_fast=False,
    padding_side="left",
    trust_remote_code=True,
)

tokenizer_falcon.pad_token = tokenizer_falcon.eos_token

model_falcon = transformers.pipeline(
    "feature-extraction",
    model=model_name_falcon,
    tokenizer=tokenizer_falcon,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

model_falcon('hallo!')

just crashes

ssm1990 · October 11, 2023, 9:05pm

I am a bit unsure here but the issue may either be with the Falcon tokenizer pad/eos confusion or worse feature-extraction pipeline compatibility. As much as I know falcon does not output embedding directly or trained as a sentence transformer. A bypass I am trying now, is to follow the sequence classification pipeline and take the feature from the eos token instead of passing it to dense classifier layers.

Topic		Replies	Views
Extracting token embeddings from pretrained language models Beginners	9	22038	May 2, 2024
How to use the "feature-extraction" pipleine on "facebook/galactica" model Beginners	1	537	November 10, 2023
Different embeddings when using sentence transformers and transformers.js Beginners	3	898	April 19, 2024
Falcon for translation 🤗Transformers	0	257	August 1, 2023
Mistral model generates the same embeddings for different input texts 🤗Transformers	2	338	April 12, 2024

Getting text embedding form Falcon model

Related topics