Set dataset to pytorch tensors produce class list making the model unable to process the data

emmakelo · July 6, 2021, 11:26am

Hello,

I am folllowing this tutorial to use Fine-tuning a pretrained model — transformers 4.7.0 documentation in order to use the flauBert to produce embeddings to train my classifier. In one of the lines , I have to set my dataset to pytorch tensors but when applying that line I get a list format which I do not understand. When printing element of the dataset I get tensors but when trying to pass the “input_ids” to the model , it is actually a list so the model cannot treat the data. Could help me figure it out why I get list and not a pytoch tensors when using 'set_format to torch.

def get_flaubert_layer(texte, path_to_lge): # last version

    tokenized_dataset, lge_size = preprocessed_with_flaubert(texte, path_to_lge)
    print("Set data to torch format...")
    tokenized_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])
    print("Format of data after -set.format- :", type(tokenized_dataset))
    print("Format of input_ids after -set.format- :", type(tokenized_dataset['input_ids']))
    print("Format of one element in the dataset", type(tokenized_dataset['input_ids'][0]))
    print(type(tokenized_dataset))
    print('Loading model...')
    flaubert = FlaubertModel.from_pretrained(path_to_lge)
    hidden_state = flaubert(input_ids=tokenized_dataset['input_ids'],
                            attention_mask=tokenized_dataset['attention_mask'])
    print(hidden_state[0][:, 0])
    cls_embedding = hidden_state[0][:, 0]
    print(cls_embedding)

#  test with data
path = '/gpfswork/rech/kpf/umg16uw/expe_5/model/sm'
print("Load model...")
flaubert = FlaubertModel.from_pretrained(path)
emb, s = get_flaubert_layer(data1, path)

Stacktrace and results

  0%|          | 0/4 [00:00<?, ?ba/s]
 25%|██▌       | 1/4 [00:00<00:01,  1.63ba/s]
 50%|█████     | 2/4 [00:01<00:01,  1.72ba/s]
 75%|███████▌  | 3/4 [00:01<00:00,  1.81ba/s]
100%|██████████| 4/4 [00:01<00:00,  2.41ba/s]
Traceback (most recent call last):
  File "/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/traitements/remove_noise.py", line 130, in <module>
    emb, s = get_flaubert_layer(data1, path)
  File "/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/traitements/functions_for_processing.py", line 206, in get_flaubert_layer
    hidden_state = flaubert(input_ids=tokenized_dataset['input_ids'],
  File "/linkhome/rech/genlig01/umg16uw/.conda/envs/bert/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/linkhome/rech/genlig01/umg16uw/.conda/envs/bert/lib/python3.9/site-packages/transformers/models/flaubert/modeling_flaubert.py", line 174, in forward
    bs, slen = input_ids.size()
AttributeError: 'list' object has no attribute 'size'
srun: error: r11i0n6: task 0: Exited with exit code 1
srun: Terminating job step 381611.0

real	1m37.690s
user	0m0.013s

format of the data passed to the flaubert model :

Loading tokenizer...
Transform data to format Dataset...
Set data to torch format...
Format of data after -set.format- : <class 'datasets.arrow_dataset.Dataset'>
Format of input_ids after -set.format- : <class 'list'>
Format of one element in the dataset <class 'torch.Tensor'>

As you see the columns input_ids is in a formal list and not tensors

thomwolf · July 7, 2021, 10:45am

Try to change this:

in this:

hidden_state = flaubert(**tokenized_dataset[0])

emmakelo · July 12, 2021, 2:44pm

I am now getting this error :
0%| | 0/1 [00:00<?, ?ba/s]
100%|██████████| 1/1 [00:00<00:00, 5.38ba/s]
100%|██████████| 1/1 [00:00<00:00, 5.38ba/s]
Traceback (most recent call last):
File “/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/platal/exec.py”, line 10, in
classification_fmc(model, file)
File “/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/platal/classification_fmc_platal.py”, line 89, in classification_fmc
Xtest_emb, s = get_flaubert_layer(Xtest[‘sent’], path_to_model_lge) # index 2 correspond to sentences
File “/gpfs7kw/linkhome/rech/genlig01/umg16uw/test/expe_5/platal/…/traitements/functions_for_processing.py”, line 212, in get_flaubert_layer
hidden_state = flaubert(**tokenized_dataset[0])
File “/linkhome/rech/genlig01/umg16uw/.conda/envs/bert/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “/linkhome/rech/genlig01/umg16uw/.conda/envs/bert/lib/python3.9/site-packages/transformers/models/flaubert/modeling_flaubert.py”, line 174, in forward
bs, slen = input_ids.size()
ValueError: not enough values to unpack (expected 2, got 1)
srun: error: r13i1n0: task 0: Exited with exit code 1
srun: Terminating job step 428869.0

lhoestq · July 20, 2021, 9:23am

It looks like your input_ids is a list of torch tensors. Maybe you didn’t tokenize your dataset to have all the tokenized texts have the same length, in this case you need to use a data collator to pad all the tokenized texts in your batch to the same length in order to create one single tensor for your batch.

Another option is to do the padding during the tokenization, so that all examples have the same length.

Topic		Replies	Views
Set the format of the datasets to return pytorch tensors return list of tensors but why? Beginners	3	3876	July 13, 2021
Set_format('torch') returns lists of tensors for multiple-entries sample 🤗Datasets	2	480	November 11, 2022
Unable to properly map tensors to examples 🤗Datasets	6	1285	December 15, 2022
Getting list of tensors instead of tensor array after using set_format 🤗Datasets	1	2154	November 30, 2021
Returns list of tensors instead of tensors with set_format in datasets Beginners	1	670	March 8, 2022

Set dataset to pytorch tensors produce class list making the model unable to process the data

Related topics