Why can't I pass my directly encoded inputs to a model?

Hey,

I’m in an internship and I’ve been using HuggingFace for 3 months now.
I’ve been making dirty scripts that made it to the point, but now I would like to better understand the library.

As far as I read in the docs, I should be able to encode an input and then pass it to a model (in my case a mBART, BARThez), then decoding the output, for my input to be better suited to my problem.

However, when I run this very simple code on Colab (with or without GPU) :

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

print("Tokenizer loading...")
tokenizer = AutoTokenizer.from_pretrained("moussaKam/barthez-orangesum-abstract")
print("Model loading...")
model = AutoModelForSeq2SeqLM.from_pretrained("/content/drive/MyDrive/Colab Notebooks/models/trained_on_datcha")
sentence = "J'aime beaucoup les courgettes"
inputs = tokenizer.encode(sentence, padding=True, truncation=True, max_length=400)
outputs = model(inputs)
decoded = tokenizer.decode(outputs)
print(decoded)

It ends either with a RAM issue crashing my Notebook, or just an error.

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-4-59d9ac7f77ce> in <module>()
      8 sentence = "J'aime beaucoup les courgettes"
      9 inputs = tokenizer.encode(sentence, padding=True, truncation=True, max_length=400)
---> 10 outputs = model(inputs)
     11 decoded = tokenizer.decode(outputs)
     12 print(decoded)

4 frames

/usr/local/lib/python3.7/dist-packages/transformers/models/mbart/modeling_mbart.py in shift_tokens_right(input_ids, pad_token_id)
     77     have a single `decoder_start_token_id` in contrast to other Bart-like models.
     78     """
---> 79     prev_output_tokens = input_ids.clone()
     80 
     81     if pad_token_id is None:

AttributeError: 'list' object has no attribute 'clone'

I would like to know what I’m doing wrong, since I’ve troubles understanding how the library works…

Thank you in advance

Hi,

The model you’re loading (AutoModelForSeq2SeqLM) is a PyTorch model. Hence, it expects PyTorch tensors as input.

However, you are providing lists to it.

The tokenizer.encode method shouldn’t be used actually, we recommend to always just call the tokenizer, and specify the return_tensors argument (set it to “pt” for PyTorch tensors, “tf” for TensorFlow tensors, “jax” for JAX DeviceArrays, etc.):

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("moussaKam/barthez-orangesum-abstract")
model = AutoModelForSeq2SeqLM.from_pretrained("/content/drive/MyDrive/Colab Notebooks/models/trained_on_datcha")

sentence = "J'aime beaucoup les courgettes"
inputs = tokenizer(sentence, padding=True, truncation=True, max_length=400, return_tensors="pt")

outputs = model(**inputs)

Note that when calling the tokenizer, it returns a dictionary of the necessary inputs (which in this case are input_ids and attention_mask). Hence I use ** to unpack the dictionary in Python.

1 Like

Thank you for the answer, however, it’s taking an unusually long time to process and even returns me the following error :

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-3-31d3aa40d5d6> in <module>()
      9 inputs = tokenizer(sentence, padding=True, truncation=True, max_length=400, return_tensors="pt")
     10 outputs = model(**inputs)
---> 11 decoded = tokenizer.decode(outputs)
     12 print(decoded)

1 frames

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
    546         if isinstance(token_ids, int):
    547             token_ids = [token_ids]
--> 548         text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
    549 
    550         if clean_up_tokenization_spaces:

TypeError: Can't convert {'logits': [[[40.902488708496094, -0.9500141739845276, 23.778488159179688, 2.97902250289917, 1.0444610118865967, 0.6736031770706177, 1.643608570098877, 7.16114616394043, 9.330695152282715, 2.9293270111083984, 4.922652244567871, -2.1415107250213623, 0.7898175716400146, 7.592775344848633, 11.821823120117188, 4.072318077087402, 6.06789493560791, -0.7886749505996704, -0.48601233959198, 0.9719571471214294, 1.0579102039337158, 4.57142448425293, -4.199000358581543, 3.8000001907348633, 3.0659127235412598, -2.7590901851654053, 4.784829139709473, 2.1427366733551025, -1.2063086032867432, 4.118697166442871, 10.903802871704102, -4.32050895690918, 6.891828536987305, 9.28072738647461, 4.505707263946533, 5.600512504577637, -0.4364454746246338, 3.474435329437256, 13.781015396118164, 2.1815648078918457, 2.7080931663513184, 8.260913848876953, 5.871652603149414, 10.788243293762207, -0.4696718454360962, 3.8598363399505615, 7.86379337310791, 10.398337364196777, 0.23873519897460938, 4.820779800415039, 4.705344200134277, -10.943077087402344, 11.51213264465332, 5.284116744995117, -4.4821977615356445, 0.13441193103790283, 5.286510944366455, 2.136887550354004, 4.943244934082031, 2.728332996368408, 0.04189950227737427, 5.49397087097168, 4.022393226623535, 11.157049179077148, 1.551186203956604, -1.5785623788833618, 0.356919527053833, 2.1878066062927246, 6.765768527984619, 0.19461774826049805, 2.3888068199157715, -0.45217859745025635, 6.269065856933594, -1.6246172189...


Hi,

The outputs of the model are also a dictionary by default. So you can’t do tokenizer.decode(outputs) directly, as the decode method of the tokenizer doesn’t expect a dictionary.

What is the task you want to use the model for? Because for a model like AutoModelForSeq2SeqLM, one should use the generate method to autoregressively generate text based on the prompt.

Here’s an example with a T5 model from the hub:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

# inference
input_ids = tokenizer(
    "summarize: studies have shown that owning a dog is good for you", return_tensors="pt"
).input_ids  # Batch size 1
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# studies have shown that owning a dog is good for you.

Hey, thank you a lot for the answer.

I’ve put your answer as solution, however, I’m quite confused on what the model(**) method should be used for.

Also, I’m confused because I chose to decompose the process because I didn’t know why I wasn’t able to correctly truncate my inputs when I was using a pipeline.

I guess I should make another post after re-re-reading the docs if I still don’t understand.

Thank you very much

I highly recommend to check out our free course.

It has an entire chapter on fine-tuning seq2seq models (like BART or T5) for summarization: Summarization - Hugging Face Course