I am new to Huggingface and have few basic queries. This post might be helpful to others as well who are starting to use longformer model from huggingface.
Objective:
Create Sentence/document embeddings using longformer model. We don’t have lables in our data-set, so we want to do clustering on output of embeddings generated. Please let me know if the code is correct?
Environment info
-
transformers
version:3.0.2 - Platform:
- Python version: Python 3.6.12 :: Anaconda, Inc.
- PyTorch version (GPU?):1.7.1
- Tensorflow version (GPU?): 2.3.0
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: parallel
Who can help
##Models:
- longformer, reformer, transfoxl, xlnet: @patrickvonplaten
Library:
- benchmarks: @patrickvonplaten
- text generation: @patrickvonplaten
- tokenizers: @LysandreJik
- trainer: @sgugger
Information
Model I am using longformer trained on LongformerForMaskedLM:
The problem arises when using:
- [ ] my own modified scripts: (give details below)
The tasks I am working on is:
- [ ] my own task or dataset: (give details below)
Code:
from transformers import LongformerModel, LongformerTokenizer
model = LongformerModel.from_pretrained('allenai/longformer-base-4096',output_hidden_states = True)
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
df = pd.read_csv("inshort_news_data-1.csv")
df.head(5)
#**news_article** column is used to generate embedding.
all_content=list(df['news_article'])
def sentence_bert():
list_of_emb=[]
for i in range(len(all_content)):
SAMPLE_TEXT = all_content[i] # long input document
print("length of string: ",len(SAMPLE_TEXT.split()))
input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)
# How to include batch of size here?
# Attention mask values -- 0: no attention, 1: local attention, 2: global attention
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
attention_mask[:, [0,-1]] = 2
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
hidden_states = outputs[2]
token_embeddings = torch.stack(hidden_states, dim=0)
# Remove dimension 1, the "batches".
token_embeddings = torch.squeeze(token_embeddings, dim=1)
# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)
token_vecs_sum = []
# For each token in the sentence...
for token in token_embeddings:
#but preferrable is
sum_vec=torch.sum(token[-4:],dim=0)
# Use `sum_vec` to represent `token`.
token_vecs_sum.append(sum_vec)
h=0
for i in range(len(token_vecs_sum)):
h+=token_vecs_sum[i]
list_of_emb.append(h)
return list_of_emb
f=sentence_bert()
Doubts/Question:
- If we want to get embeddings in batches, what all changes do I need to make in the above code?
- If the sentence is " I am learning longformer model.". Will the tokenizer function will return ID’s of following token in longformer model: [ ‘I’, ‘am’ , ‘learning’ , ‘longformer’ , 'model. '] Is my understanding correct? Can you explain it with minimum reproducible example?
- Similarly attention mask will return attention weights of following tokens? The part which I didn’t understand is its necessary to replace last attention weight of sentence by 2 (in above code)?
-
#outputs[0] gives us sequence_output: torch.Size([768])
#outputs[1] gives us pooled_output torch.Size([1, 512, 768])
#outputs[2]: gives us Hidden_output: torch.Size([13, 512, 768])
Can you talk more about what does each dimension depicts in outputs? Example what does Hidden_output [13, 512, 768]
means ? From where 13, 512 and 768 is coming ? What does 13, 512 and 768 means in terms of hidden state, embedding dimesion and number of layesr? - From which token do we get the sentence embedding in longformer? Can you explain it with minimum reproducible example?
- If I am running the model in linux system, where does pre-trained model get’s downloaded or stored? Can you list the complete path?
- length of string: 15
input_ids: tensor([[ 0, 35702, 1437, 3743, 1437, 560, 1437, 48317, 1437, 28884,
20042, 1437, 6968, 241, 1437, 16402, 1437, 463, 1437, 3056,
1437, 48317, 1437, 281, 1437, 16752, 1437, 281, 1437, 1694,
1437, 7424, 4, 2]])
input_ids.shape: torch.Size([1, 34])
My sentence length is 15 then why input_ids and attention_ids are length 34?
Expected behavior
Document1: Embeddings
Document2: Embeddings