Why can't I pass my directly encoded inputs to a model?

Orionss · July 25, 2022, 11:24am

Hey,

I’m in an internship and I’ve been using HuggingFace for 3 months now.
I’ve been making dirty scripts that made it to the point, but now I would like to better understand the library.

As far as I read in the docs, I should be able to encode an input and then pass it to a model (in my case a mBART, BARThez), then decoding the output, for my input to be better suited to my problem.

However, when I run this very simple code on Colab (with or without GPU) :

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

print("Tokenizer loading...")
tokenizer = AutoTokenizer.from_pretrained("moussaKam/barthez-orangesum-abstract")
print("Model loading...")
model = AutoModelForSeq2SeqLM.from_pretrained("/content/drive/MyDrive/Colab Notebooks/models/trained_on_datcha")
sentence = "J'aime beaucoup les courgettes"
inputs = tokenizer.encode(sentence, padding=True, truncation=True, max_length=400)
outputs = model(inputs)
decoded = tokenizer.decode(outputs)
print(decoded)

It ends either with a RAM issue crashing my Notebook, or just an error.

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-4-59d9ac7f77ce> in <module>()
      8 sentence = "J'aime beaucoup les courgettes"
      9 inputs = tokenizer.encode(sentence, padding=True, truncation=True, max_length=400)
---> 10 outputs = model(inputs)
     11 decoded = tokenizer.decode(outputs)
     12 print(decoded)

4 frames

/usr/local/lib/python3.7/dist-packages/transformers/models/mbart/modeling_mbart.py in shift_tokens_right(input_ids, pad_token_id)
     77     have a single `decoder_start_token_id` in contrast to other Bart-like models.
     78     """
---> 79     prev_output_tokens = input_ids.clone()
     80 
     81     if pad_token_id is None:

AttributeError: 'list' object has no attribute 'clone'

I would like to know what I’m doing wrong, since I’ve troubles understanding how the library works…

Thank you in advance

nielsr · July 25, 2022, 11:40am

Hi,

The model you’re loading (AutoModelForSeq2SeqLM) is a PyTorch model. Hence, it expects PyTorch tensors as input.

However, you are providing lists to it.

The tokenizer.encode method shouldn’t be used actually, we recommend to always just call the tokenizer, and specify the return_tensors argument (set it to “pt” for PyTorch tensors, “tf” for TensorFlow tensors, “jax” for JAX DeviceArrays, etc.):

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("moussaKam/barthez-orangesum-abstract")
model = AutoModelForSeq2SeqLM.from_pretrained("/content/drive/MyDrive/Colab Notebooks/models/trained_on_datcha")

sentence = "J'aime beaucoup les courgettes"
inputs = tokenizer(sentence, padding=True, truncation=True, max_length=400, return_tensors="pt")

outputs = model(**inputs)

Note that when calling the tokenizer, it returns a dictionary of the necessary inputs (which in this case are input_ids and attention_mask). Hence I use ** to unpack the dictionary in Python.

Orionss · July 25, 2022, 12:00pm

Thank you for the answer, however, it’s taking an unusually long time to process and even returns me the following error :

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-3-31d3aa40d5d6> in <module>()
      9 inputs = tokenizer(sentence, padding=True, truncation=True, max_length=400, return_tensors="pt")
     10 outputs = model(**inputs)
---> 11 decoded = tokenizer.decode(outputs)
     12 print(decoded)

1 frames

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
    546         if isinstance(token_ids, int):
    547             token_ids = [token_ids]
--> 548         text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
    549 
    550         if clean_up_tokenization_spaces:

TypeError: Can't convert {'logits': [[[40.902488708496094, -0.9500141739845276, 23.778488159179688, 2.97902250289917, 1.0444610118865967, 0.6736031770706177, 1.643608570098877, 7.16114616394043, 9.330695152282715, 2.9293270111083984, 4.922652244567871, -2.1415107250213623, 0.7898175716400146, 7.592775344848633, 11.821823120117188, 4.072318077087402, 6.06789493560791, -0.7886749505996704, -0.48601233959198, 0.9719571471214294, 1.0579102039337158, 4.57142448425293, -4.199000358581543, 3.8000001907348633, 3.0659127235412598, -2.7590901851654053, 4.784829139709473, 2.1427366733551025, -1.2063086032867432, 4.118697166442871, 10.903802871704102, -4.32050895690918, 6.891828536987305, 9.28072738647461, 4.505707263946533, 5.600512504577637, -0.4364454746246338, 3.474435329437256, 13.781015396118164, 2.1815648078918457, 2.7080931663513184, 8.260913848876953, 5.871652603149414, 10.788243293762207, -0.4696718454360962, 3.8598363399505615, 7.86379337310791, 10.398337364196777, 0.23873519897460938, 4.820779800415039, 4.705344200134277, -10.943077087402344, 11.51213264465332, 5.284116744995117, -4.4821977615356445, 0.13441193103790283, 5.286510944366455, 2.136887550354004, 4.943244934082031, 2.728332996368408, 0.04189950227737427, 5.49397087097168, 4.022393226623535, 11.157049179077148, 1.551186203956604, -1.5785623788833618, 0.356919527053833, 2.1878066062927246, 6.765768527984619, 0.19461774826049805, 2.3888068199157715, -0.45217859745025635, 6.269065856933594, -1.6246172189...

nielsr · July 25, 2022, 12:04pm

Hi,

The outputs of the model are also a dictionary by default. So you can’t do tokenizer.decode(outputs) directly, as the decode method of the tokenizer doesn’t expect a dictionary.

What is the task you want to use the model for? Because for a model like AutoModelForSeq2SeqLM, one should use the generate method to autoregressively generate text based on the prompt.

Here’s an example with a T5 model from the hub:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

# inference
input_ids = tokenizer(
    "summarize: studies have shown that owning a dog is good for you", return_tensors="pt"
).input_ids  # Batch size 1
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# studies have shown that owning a dog is good for you.

Orionss · July 25, 2022, 12:24pm

Hey, thank you a lot for the answer.

I’ve put your answer as solution, however, I’m quite confused on what the model(**) method should be used for.

Also, I’m confused because I chose to decompose the process because I didn’t know why I wasn’t able to correctly truncate my inputs when I was using a pipeline.

I guess I should make another post after re-re-reading the docs if I still don’t understand.

Thank you very much

nielsr · July 25, 2022, 2:54pm

I highly recommend to check out our free course.

It has an entire chapter on fine-tuning seq2seq models (like BART or T5) for summarization: Summarization - Hugging Face Course

Topic		Replies	Views
Cannot convert mbart from fairseq to huggingface using the script in the repo 🤗Transformers	3	1253	February 8, 2022
Warm-started encoder-decoder models (Bert2Gpt2 and Bert2Bert) Beginners	11	2492	June 9, 2024
Getting the following error "valueError: You have to specify either decoder_input_ids or decoder_inputs_embeds" Models	2	294	May 6, 2024
Train T5/BART to convert a string into multiple strings 🤗Transformers	1	1677	December 10, 2022
I get a "You have to specify either input_ids or inputs_embeds" error, but I do specify the input ids Beginners	6	21071	October 31, 2021

Why can't I pass my directly encoded inputs to a model?

Related topics