Set dataset to pytorch tensors produce class list making the model unable to process the data

Hello,

I am folllowing this tutorial to use Fine-tuning a pretrained model β€” transformers 4.7.0 documentation in order to use the flauBert to produce embeddings to train my classifier. In one of the lines , I have to set my dataset to pytorch tensors but when applying that line I get a list format which I do not understand. When printing element of the dataset I get tensors but when trying to pass the β€œinput_ids” to the model , it is actually a list so the model cannot treat the data. Could help me figure it out why I get list and not a pytoch tensors when using 'set_format to torch.

def get_flaubert_layer(texte, path_to_lge): # last version

    tokenized_dataset, lge_size = preprocessed_with_flaubert(texte, path_to_lge)
    print("Set data to torch format...")
    tokenized_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])
    print("Format of data after -set.format- :", type(tokenized_dataset))
    print("Format of input_ids after -set.format- :", type(tokenized_dataset['input_ids']))
    print("Format of one element in the dataset", type(tokenized_dataset['input_ids'][0]))
    print(type(tokenized_dataset))
    print('Loading model...')
    flaubert = FlaubertModel.from_pretrained(path_to_lge)
    hidden_state = flaubert(input_ids=tokenized_dataset['input_ids'],
                            attention_mask=tokenized_dataset['attention_mask'])
    print(hidden_state[0][:, 0])
    cls_embedding = hidden_state[0][:, 0]
    print(cls_embedding)

#  test with data
path = '/gpfswork/rech/kpf/umg16uw/expe_5/model/sm'
print("Load model...")
flaubert = FlaubertModel.from_pretrained(path)
emb, s = get_flaubert_layer(data1, path)

Stacktrace and results

  0%|          | 0/4 [00:00<?, ?ba/s]
 25%|β–ˆβ–ˆβ–Œ       | 1/4 [00:00<00:01,  1.63ba/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 2/4 [00:01<00:01,  1.72ba/s]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 3/4 [00:01<00:00,  1.81ba/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:01<00:00,  2.41ba/s]
Traceback (most recent call last):
  File "/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/traitements/remove_noise.py", line 130, in <module>
    emb, s = get_flaubert_layer(data1, path)
  File "/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/traitements/functions_for_processing.py", line 206, in get_flaubert_layer
    hidden_state = flaubert(input_ids=tokenized_dataset['input_ids'],
  File "/linkhome/rech/genlig01/umg16uw/.conda/envs/bert/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/linkhome/rech/genlig01/umg16uw/.conda/envs/bert/lib/python3.9/site-packages/transformers/models/flaubert/modeling_flaubert.py", line 174, in forward
    bs, slen = input_ids.size()
AttributeError: 'list' object has no attribute 'size'
srun: error: r11i0n6: task 0: Exited with exit code 1
srun: Terminating job step 381611.0

real	1m37.690s
user	0m0.013s

format of the data passed to the flaubert model :

Loading tokenizer...
Transform data to format Dataset...
Set data to torch format...
Format of data after -set.format- : <class 'datasets.arrow_dataset.Dataset'>
Format of input_ids after -set.format- : <class 'list'>
Format of one element in the dataset <class 'torch.Tensor'>

As you see the columns input_ids is in a formal list and not tensors

Try to change this:

in this:

hidden_state = flaubert(**tokenized_dataset[0])

I am now getting this error :
0%| | 0/1 [00:00<?, ?ba/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 5.38ba/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 5.38ba/s]
Traceback (most recent call last):
File β€œ/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/platal/exec.py”, line 10, in
classification_fmc(model, file)
File β€œ/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/platal/classification_fmc_platal.py”, line 89, in classification_fmc
Xtest_emb, s = get_flaubert_layer(Xtest[β€˜sent’], path_to_model_lge) # index 2 correspond to sentences
File β€œ/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/platal/…/traitements/functions_for_processing.py”, line 212, in get_flaubert_layer
hidden_state = flaubert(**tokenized_dataset[0])
File β€œ/linkhome/rech/genlig01/umg16uw/.conda/envs/bert/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File β€œ/linkhome/rech/genlig01/umg16uw/.conda/envs/bert/lib/python3.9/site-packages/transformers/models/flaubert/modeling_flaubert.py”, line 174, in forward
bs, slen = input_ids.size()
ValueError: not enough values to unpack (expected 2, got 1)
srun: error: r13i1n0: task 0: Exited with exit code 1
srun: Terminating job step 428869.0

It looks like your input_ids is a list of torch tensors. Maybe you didn’t tokenize your dataset to have all the tokenized texts have the same length, in this case you need to use a data collator to pad all the tokenized texts in your batch to the same length in order to create one single tensor for your batch.

Another option is to do the padding during the tokenization, so that all examples have the same length.