Using different word embbedings from other sources and applying them on other other HF models

biologin · May 20, 2022, 10:48am

Context: For what I’ve gathered, we usually transform our sentences into tokens which will then generate a word embbeding and feed into the model (and this model has been pre-trained on a given corpus such as BERT word embbedings) as shown in this schema:

Now, what if we want to use different word embbedings coming from other sources? For example, I’m trying to do sentiment analysis with 3 labels (positive, neutral, negative) in several languages and for that I may want to use a “glove.840B.300d.txt” or multi language from fasttext (such as this one Word vectors for 157 languages · fastText) which may not have an associated model in the HF hub

Without transformers, I’d do something like this to create a embbedding matrix:

# load the pre-trained word-embedding vectors 

embeddings_index = {}

for i, line in enumerate(open('glove.840B.300d.txt', encoding="utf8")):

    values = line.split()

    try:

      embeddings_index[values[0]] = np.asarray(values[1:], dtype='float32')

    except:

      continue

# create a tokenizer 

token = Tokenizer(num_words = 10000, oov_token="<OOV>") #out of vocabulary tokens will be represented as "1" in the ..._seq_x variables. index 0 is reserved for padding and index 1 is reserved for "out of vocabulary" tokens.

token.fit_on_texts(df3[colToUse]) #The fit_on_texts function is used to fit the Tokenizer on the training set once it has been instantiated with the preferred parameters. Required before using texts_to_sequences

word_index = token.word_index #The word_index can be used to show the mapping of the words to numbers. Each word will have a number

# convert text to sequence of tokens and pad them to ensure equal length vectors  -> Each sub-list is a row in the dataframe encoded. This is done by texts_to_sequences

train_seq_x = sequence.pad_sequences(token.texts_to_sequences(train_x), maxlen=248)

valid_seq_x = sequence.pad_sequences(token.texts_to_sequences(valid_x), maxlen=248)

# create token-embedding mapping

embedding_matrix = np.zeros((len(word_index) + 1, 300)) #initiate the matrix with all 0s. This will be later on updated with the value for each word

for word, i in word_index.items():

    embedding_vector = embeddings_index.get(word)

    if embedding_vector is not None:

        embedding_matrix[i] = embedding_vector

Which would yield a training sequence (padded) like this:

array([[   0,    0,    0,    0,    ...,
           0,    0,    0,    0,   38,  352, 5391,    2,   56,  352,  316,
         343,  436,  195,  285,   30,   18]], dtype=int32)

Is it possible then to pass different embbeding matrices (coming from different word embbedings) to different models? For example, passing an embbeding matrix as input coming from the multilanguage fasttext (possibly using the AutoTokenizer??) and using the model = AutoModelForSequenceClassification.from_pretrained(tokenizer.tokenize(sequence)) with a bert-uncased for example?

Thank you in advance (I’m really new to all this)

Topic		Replies	Views
Generate raw word embeddings using transformer models like BERT for downstream process Beginners	9	39887	October 4, 2021
Difficulty putting simple sentence-piece averaging model on Hub 🤗Hub	0	688	November 5, 2022
Embeddings via API fundamental doubts Models	0	595	September 10, 2023
Do you have to use a model card's accompanying tokenizer? Beginners	1	307	November 4, 2022
How the vocabulary of BERT tokenizer is generated? 🤗Transformers	2	2936	January 6, 2024

Using different word embbedings from other sources and applying them on other other HF models

Related topics