I am new to Huggingface and have few basic queries. This post might be helpful to others as well who are starting to use longformer model from huggingface.
Create Sentence/document embeddings using longformer model. We don’t have lables in our data-set, so we want to do clustering on output of embeddings generated. Please let me know if the code is correct?
- Python version: Python 3.6.12 :: Anaconda, Inc.
- PyTorch version (GPU?):1.7.1
- Tensorflow version (GPU?): 2.3.0
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: parallel
- longformer, reformer, transfoxl, xlnet: @patrickvonplaten
- benchmarks: @patrickvonplaten
- text generation: @patrickvonplaten
- tokenizers: @LysandreJik
- trainer: @sgugger
Model I am using longformer trained on LongformerForMaskedLM:
The problem arises when using:
- [ ] my own modified scripts: (give details below)
The tasks I am working on is:
- [ ] my own task or dataset: (give details below)
from transformers import LongformerModel, LongformerTokenizer model = LongformerModel.from_pretrained('allenai/longformer-base-4096',output_hidden_states = True) tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096') # Put the model in "evaluation" mode, meaning feed-forward operation. model.eval() df = pd.read_csv("inshort_news_data-1.csv") df.head(5) #**news_article** column is used to generate embedding.
all_content=list(df['news_article']) def sentence_bert(): list_of_emb= for i in range(len(all_content)): SAMPLE_TEXT = all_content[i] # long input document print("length of string: ",len(SAMPLE_TEXT.split())) input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0) # How to include batch of size here? # Attention mask values -- 0: no attention, 1: local attention, 2: global attention attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention attention_mask[:, [0,-1]] = 2 with torch.no_grad(): outputs = model(input_ids, attention_mask=attention_mask) hidden_states = outputs token_embeddings = torch.stack(hidden_states, dim=0) # Remove dimension 1, the "batches". token_embeddings = torch.squeeze(token_embeddings, dim=1) # Swap dimensions 0 and 1. token_embeddings = token_embeddings.permute(1,0,2) token_vecs_sum =  # For each token in the sentence... for token in token_embeddings: #but preferrable is sum_vec=torch.sum(token[-4:],dim=0) # Use `sum_vec` to represent `token`. token_vecs_sum.append(sum_vec) h=0 for i in range(len(token_vecs_sum)): h+=token_vecs_sum[i] list_of_emb.append(h) return list_of_emb f=sentence_bert()
- If we want to get embeddings in batches, what all changes do I need to make in the above code?
- If the sentence is " I am learning longformer model.". Will the tokenizer function will return ID’s of following token in longformer model: [ ‘I’, ‘am’ , ‘learning’ , ‘longformer’ , 'model. '] Is my understanding correct? Can you explain it with minimum reproducible example?
- Similarly attention mask will return attention weights of following tokens? The part which I didn’t understand is its necessary to replace last attention weight of sentence by 2 (in above code)?
#outputs gives us sequence_output: torch.Size()
#outputs gives us pooled_output torch.Size([1, 512, 768])
#outputs: gives us Hidden_output: torch.Size([13, 512, 768])
Can you talk more about what does each dimension depicts in outputs? Example what does Hidden_output [13, 512, 768]
means ? From where 13, 512 and 768 is coming ? What does 13, 512 and 768 means in terms of hidden state, embedding dimesion and number of layesr?
- From which token do we get the sentence embedding in longformer? Can you explain it with minimum reproducible example?
- If I am running the model in linux system, where does pre-trained model get’s downloaded or stored? Can you list the complete path?
- length of string: 15
input_ids: tensor([[ 0, 35702, 1437, 3743, 1437, 560, 1437, 48317, 1437, 28884,
20042, 1437, 6968, 241, 1437, 16402, 1437, 463, 1437, 3056,
1437, 48317, 1437, 281, 1437, 16752, 1437, 281, 1437, 1694,
1437, 7424, 4, 2]])
input_ids.shape: torch.Size([1, 34])
My sentence length is 15 then why input_ids and attention_ids are length 34?