Hello ,
I recently install transformers version 4.8.0 and I am trying to retrieve the first hidden state of each sentence of my dataset. Somehow I encountering the error above and I cannot find how to solve. I am not so so familiar with torch tensor but by observing the stackstrace I think that’s where the error commes. My original corpus is made up of 50000 sentences in dataframe but for confirming the function runs, I tested on a small sample of 35 sentences (0 to 34 index ).
Find below the function:
print(texte)
0 si j’ai un problème, comment je remonte l’info...
1 des agents de maintenance ? Oui, oui. Enfin… I...
2 Il faudrait des tiroirs qui sortent / rentrent...
3 ROI, 5 à 10 ans. Si l’énergie explose, ça devi...
4 Je ne vois pas cela en conception de cuisine, ...
path_to_lge = "flaubert/flaubert_small_cased"
flaubert = FlaubertModel.from_pretrained(path_to_lge)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(path_to_lge, do_lowercase=False)
input_ids = []
attention_masks = []
for sent in texte: # class pandas series core
encoded_sent = flaubert_tokenizer.encode_plus(sent, add_special_tokens=True, truncation=True, padding=True, return_attention_mask=True)
# Add the outputs to the lists
input_ids.append(encoded_sent.get('input_ids'))
attention_masks.append(encoded_sent.get('attention_mask'))
# Convert lists to tensors
print("len", len(input_ids))
input_ids = torch.tensor(input_ids)
attention_mask = torch.tensor(attention_masks)
hidden_state = flaubert(input_ids=input_ids, attention_mask=attention_mask)
# Extract the last hidden state of the token `[CLS]` for classification task
last_hidden_state_cls = outputs[0][:, 0, :]
print(last_hidden_state_cls)
stack trace :
---Filename in processed................ corpus_ix_originel_FMC_train
etiquette : [2 1 0]
Embeddings bert model used.................... : small_cased
Some weights of the model checkpoint at flaubert/flaubert_small_cased were not used when initializing FlaubertModel: ['pred_layer.proj.weight', 'pred_layer.proj.bias']
- This IS expected if you are initializing FlaubertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing FlaubertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
<class 'numpy.ndarray'>
len = 34 sentences (from 0 to 34)
Traceback (most recent call last):
File "/16uw/test/expe_5/train/test.py", line 63, in <module>
main()
File "/16uw/test/expe_5/train/test.py", line 46, in main
dic_acc, dic_report, dic_cm, s = cross_validation(data_train, data_label_train, models_list, name, language_model_dir)
File "/16uw/test/expe_5/train/../traitements/processin_test.py", line 197, in cross_validation
features, s = get_flaubert_layer(features, lge_model)
File "16uw/test/expe_5/train/../traitements/processin_test.py", line 107, in get_flaubert_layer
input_ids = torch.tensor(input_ids)
ValueError: expected sequence of length 133 at dim 1 (got 80)
Hope this element may help you understand my problem