Hello,
I am folllowing this tutorial to use Fine-tuning a pretrained model — transformers 4.7.0 documentation in order to use the flauBert to produce embeddings to train my classifier. In one of the lines , I have to set my dataset to pytorch tensors but when applying that line I get a list format which I do not understand. When printing element of the dataset I get tensors but when trying to pass the “input_ids” to the model , it is actually a list so the model cannot treat the data. Could help me figure it out why I get list and not a pytoch tensors when using 'set_format to torch.
def get_flaubert_layer(texte, path_to_lge): # last version
tokenized_dataset, lge_size = preprocessed_with_flaubert(texte, path_to_lge)
print("Set data to torch format...")
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])
print("Format of data after -set.format- :", type(tokenized_dataset))
print("Format of input_ids after -set.format- :", type(tokenized_dataset['input_ids']))
print("Format of one element in the dataset", type(tokenized_dataset['input_ids'][0]))
print(type(tokenized_dataset))
print('Loading model...')
flaubert = FlaubertModel.from_pretrained(path_to_lge)
hidden_state = flaubert(input_ids=tokenized_dataset['input_ids'],
attention_mask=tokenized_dataset['attention_mask'])
print(hidden_state[0][:, 0])
cls_embedding = hidden_state[0][:, 0]
print(cls_embedding)
# test with data
path = '/gpfswork/rech/kpf/umg16uw/expe_5/model/sm'
print("Load model...")
flaubert = FlaubertModel.from_pretrained(path)
emb, s = get_flaubert_layer(data1, path)
Stacktrace and results
0%| | 0/4 [00:00<?, ?ba/s]
25%|██▌ | 1/4 [00:00<00:01, 1.63ba/s]
50%|█████ | 2/4 [00:01<00:01, 1.72ba/s]
75%|███████▌ | 3/4 [00:01<00:00, 1.81ba/s]
100%|██████████| 4/4 [00:01<00:00, 2.41ba/s]
Traceback (most recent call last):
File "/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/traitements/remove_noise.py", line 130, in <module>
emb, s = get_flaubert_layer(data1, path)
File "/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/traitements/functions_for_processing.py", line 206, in get_flaubert_layer
hidden_state = flaubert(input_ids=tokenized_dataset['input_ids'],
File "/linkhome/rech/genlig01/umg16uw/.conda/envs/bert/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/linkhome/rech/genlig01/umg16uw/.conda/envs/bert/lib/python3.9/site-packages/transformers/models/flaubert/modeling_flaubert.py", line 174, in forward
bs, slen = input_ids.size()
AttributeError: 'list' object has no attribute 'size'
srun: error: r11i0n6: task 0: Exited with exit code 1
srun: Terminating job step 381611.0
real 1m37.690s
user 0m0.013s
format of the data passed to the flaubert model :
Loading tokenizer...
Transform data to format Dataset...
Set data to torch format...
Format of data after -set.format- : <class 'datasets.arrow_dataset.Dataset'>
Format of input_ids after -set.format- : <class 'list'>
Format of one element in the dataset <class 'torch.Tensor'>
As you see the columns input_ids is in a formal list and not tensors