How to get Word Embeddings for Sentences/Documents using long-former model?

I am new to Huggingface and have few basic queries. This post might be helpful to others as well who are starting to use longformer model from huggingface.

Objective:

Create Sentence/document embeddings using longformer model. We don’t have lables in our data-set, so we want to do clustering on output of embeddings generated. Please let me know if the code is correct?

Environment info

  • transformers version:3.0.2
  • Platform:
  • Python version: Python 3.6.12 :: Anaconda, Inc.
  • PyTorch version (GPU?):1.7.1
  • Tensorflow version (GPU?): 2.3.0
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: parallel

Who can help

@patrickvonplaten

##Models:

Library:

Information

Model I am using longformer trained on LongformerForMaskedLM:

The problem arises when using:

  • [ ] my own modified scripts: (give details below)

The tasks I am working on is:

  • [ ] my own task or dataset: (give details below)

Code:

from transformers import LongformerModel, LongformerTokenizer
model = LongformerModel.from_pretrained('allenai/longformer-base-4096',output_hidden_states = True)
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

df = pd.read_csv("inshort_news_data-1.csv")
df.head(5)
#**news_article** column is used to generate embedding.
all_content=list(df['news_article'])
def sentence_bert():
    list_of_emb=[]
    for i in range(len(all_content)):
        SAMPLE_TEXT = all_content[i]  # long input document
        print("length of string:  ",len(SAMPLE_TEXT.split()))
        input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  

        # How to include batch of size here?

        # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
        attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
        attention_mask[:, [0,-1]] = 2
        
         with torch.no_grad():

            outputs = model(input_ids, attention_mask=attention_mask)
            hidden_states = outputs[2]
            token_embeddings = torch.stack(hidden_states, dim=0)
            # Remove dimension 1, the "batches".
            token_embeddings = torch.squeeze(token_embeddings, dim=1)
            # Swap dimensions 0 and 1.
            token_embeddings = token_embeddings.permute(1,0,2)

            token_vecs_sum = []
            # For each token in the sentence...
            for token in token_embeddings:

            #but preferrable is
               sum_vec=torch.sum(token[-4:],dim=0)

            # Use `sum_vec` to represent `token`.
               token_vecs_sum.append(sum_vec)

           
           h=0
           for i in  range(len(token_vecs_sum)):
              h+=token_vecs_sum[i]
           list_of_emb.append(h)

    return list_of_emb

f=sentence_bert()

Doubts/Question:

  1. If we want to get embeddings in batches, what all changes do I need to make in the above code?
  2. If the sentence is " I am learning longformer model.". Will the tokenizer function will return ID’s of following token in longformer model: [ ‘I’, ‘am’ , ‘learning’ , ‘longformer’ , 'model. '] Is my understanding correct? Can you explain it with minimum reproducible example?
  3. Similarly attention mask will return attention weights of following tokens? The part which I didn’t understand is its necessary to replace last attention weight of sentence by 2 (in above code)?
  4. #outputs[0] gives us sequence_output: torch.Size([768])
    #outputs[1] gives us pooled_output torch.Size([1, 512, 768])
    #outputs[2]: gives us Hidden_output: torch.Size([13, 512, 768])
    Can you talk more about what does each dimension depicts in outputs? Example what does Hidden_output [13, 512, 768]
    means ? From where 13, 512 and 768 is coming ? What does 13, 512 and 768 means in terms of hidden state, embedding dimesion and number of layesr?
  5. From which token do we get the sentence embedding in longformer? Can you explain it with minimum reproducible example?
  6. If I am running the model in linux system, where does pre-trained model get’s downloaded or stored? Can you list the complete path?
  7. length of string: 15
    input_ids: tensor([[ 0, 35702, 1437, 3743, 1437, 560, 1437, 48317, 1437, 28884,
    20042, 1437, 6968, 241, 1437, 16402, 1437, 463, 1437, 3056,
    1437, 48317, 1437, 281, 1437, 16752, 1437, 281, 1437, 1694,
    1437, 7424, 4, 2]])

    input_ids.shape: torch.Size([1, 34])
    My sentence length is 15 then why input_ids and attention_ids are length 34?

Expected behavior

Document1: Embeddings
Document2: Embeddings