Set the format of the datasets to return pytorch tensors return list of tensors but why?

Hello,

I am folllowing this tutorial to use Fine-tuning a pretrained model — transformers 4.7.0 documentation in order to use the flauBert to produce embeddings to train my classifier. In one of the lines , I have to set my dataset to pytorch tensors but when applying that line I get a list format which I do not understand. When printing element of the dataset I get tensors but when trying to pass the “input_ids” to the model , it is actually a list so the model cannot treat the data. Could help me figure it out why I get list and not a pytoch tensors when using 'set_format to torch.


def get_flaubert_layer(texte, path_to_lge): # last version

    tokenized_dataset, lge_size = preprocessed_with_flaubert(texte, path_to_lge)
    print("Set data to torch format...")
    tokenized_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])
    print("Format of data after -set.format- :", type(tokenized_dataset))
    print("Format of input_ids after -set.format- :", type(tokenized_dataset['input_ids']))
    print("Format of one element in the dataset", type(tokenized_dataset['input_ids'][0]))
    print(type(tokenized_dataset))
    print('Loading model...')
    flaubert = FlaubertModel.from_pretrained(path_to_lge)
    hidden_state = flaubert(input_ids=tokenized_dataset['input_ids'],
                            attention_mask=tokenized_dataset['attention_mask'])
    print(hidden_state[0][:, 0])
    cls_embedding = hidden_state[0][:, 0]
    print(cls_embedding)

#  test with data
path = '/gpfswork/rech/kpf/umg16uw/expe_5/model/sm'
print("Load model...")
flaubert = FlaubertModel.from_pretrained(path)
emb, s = get_flaubert_layer(data1, path)

Stacktrace and results

  0%|          | 0/4 [00:00<?, ?ba/s]
 25%|██▌       | 1/4 [00:00<00:01,  1.63ba/s]
 50%|█████     | 2/4 [00:01<00:01,  1.72ba/s]
 75%|███████▌  | 3/4 [00:01<00:00,  1.81ba/s]
100%|██████████| 4/4 [00:01<00:00,  2.41ba/s]
Traceback (most recent call last):
  File "/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/traitements/remove_noise.py", line 130, in <module>
    emb, s = get_flaubert_layer(data1, path)
  File "/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/traitements/functions_for_processing.py", line 206, in get_flaubert_layer
    hidden_state = flaubert(input_ids=tokenized_dataset['input_ids'],
  File "/linkhome/rech/genlig01/umg16uw/.conda/envs/bert/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/linkhome/rech/genlig01/umg16uw/.conda/envs/bert/lib/python3.9/site-packages/transformers/models/flaubert/modeling_flaubert.py", line 174, in forward
    bs, slen = input_ids.size()
AttributeError: 'list' object has no attribute 'size'
srun: error: r11i0n6: task 0: Exited with exit code 1
srun: Terminating job step 381611.0

real	1m37.690s
user	0m0.013s

format of the data passed to the flaubert model :

Loading tokenizer...
Transform data to format Dataset...
Set data to torch format...
Format of data after -set.format- : <class 'datasets.arrow_dataset.Dataset'>
Format of input_ids after -set.format- : <class 'list'>
Format of one element in the dataset <class 'torch.Tensor'>

As you see the columns input_ids is in a formal list and not tensors

The reason could be that during tokenization, padding and/or truncation is not enabled which results in encoded inputs with different lengths. This would prevent the type conversion to convert input_ids to a tensor since its elements are of different size and the result would be a list of tensors.

hello, this is my line of tokenization and I put true for truncation and padding :

tokenized_dataset = dataset.map(lambda x: flaubert_tokenizer(x['verbatim'], padding=True, truncation=True, max_length=512), batched=True)

The tokenization seems right and I don’t think it would solve anything but I would give tokenized_dataset = dataset.map(lambda x: flaubert_tokenizer(x['verbatim'], padding="max_length", truncation=True, max_length=512), batched=True) a try. The problem, I think, is different sizes of output tensors after tokenization, which I would verify next as a sanity check. Good luck!