Chapter 2 questions

Use this topic for any question about Chapter 2 of the course.

  1. In the Handling multiple sequences page of Chapter 2, there is a bug in the code under Attention masks section.

    Page: Handling multiple sequences - Hugging Face Course

    The PyTorch toggle is on, but the code uses Tensorflow’s tf.constant function.

  2. There is a typo on https://huggingface.co/course/chapter2/6?fw=pt

  3. Isn’t wordpiece a subword algorithm as well?

Thanks for flagging all of this, will push a fix in the morning!

  • In Chapter 2, under the Putting it all together page, the above code snippet should include padding=True :slight_smile:

  • Under Wrapping up: From tokenizer to model, the last line of code snippet should be changed to output = model(**tokens)

Page: Putting it all together

@sgugger I’m absolutely loving this course! A great refresher to the library with really intuitive videos and tutorials to wade through and understand the Hugging Face Library. I honestly wished I had this resource when I started out. Can’t wait for the next part of the course! :slight_smile:

2 Likes

thanks for reporting the bugs and suggested fixes @harish3110! will push a fix this afternoon :slight_smile:

1 Like

I found a small typo in the section Behind the pipeline - Postprocessing the output.

  • error: [0.9946, 0.0544]
  • correction: [0.9995, 0.0005]

The following block gives a little bit more context (see the section Postprocessing the output)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)

Now we can see that the model predicted [0.0402, 0.9598] for the first sentence and [0.9946, 0.0544] for the second one. These are recognizable probability scores.

Thanks for reporting, will fix this morning!

1 Like

In the Handling multiple sequences notebook (for Tensorflow) of Chapter 2, there is a bug in the code under Tokenization section.

No that is not a bug, the course explicitly says this doesn’t work and explains why.

Okay sorry for that, have to go over it again. Thanks!

1 Like

No worries, I understand why you’d be suprised. The notebooks are auto-generated, so I can’t add some comments in Markdown cells but I can add comments in the code!

1 Like

In Handling multiple sequences - Attention masks - Try it out, there is a caveat that may be good to mention in case someone encounters the same question.

Here, by manually tokenizing the two sentences, creating attention masks, and passing them through the model, we should be able to reproduce the same logits as in section 2, which are:

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
    array([[-1.5606991,  1.6122842],
           [ 4.169231 , -3.3464472]], dtype=float32)>

However, if you do it this way:

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
]

batched_tokens = [tokenizer.tokenize(r) for r in raw_inputs]
batched_ids = [tokenizer.convert_tokens_to_ids(t) for t in batched_tokens]

# Add padding
batched_ids = [ids + [tokenizer.pad_token_id] * (max(map(len, batched_ids))-len(ids))
               for ids in batched_ids]
attention_mask = [[0 if x == tokenizer.pad_token_id else 1 for x in ids]
                  for ids in batched_ids]

print(model(tf.constant(batched_ids), attention_mask=tf.constant(attention_mask)).logits)

The results will be different from section 2:

tf.Tensor(
[[-2.7276204  2.878937 ]
 [ 3.1930914 -2.668523 ]], shape=(2, 2), dtype=float32)

This is because tokenizer.tokenize does not add the special tokens [CLS] and [SEP] by default, whereas the high-level API tokenizer() does.

If we change this line in the code to:

batched_tokens = [tokenizer.tokenize(r, add_special_tokens=True) for r in raw_inputs]

the result is indeed the same:

tf.Tensor(
[[-1.5606964  1.6122813]
 [ 4.169231  -3.3464475]], shape=(2, 2), dtype=float32)
4 Likes

Just following the code (for PyTorch) for the course leads to a small error:

outputs = model(**inputs)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
...

TypeError: forward() got an unexpected keyword argument 'token_type_ids'

I removed the problematic token_type_ids key and it worked:

outputs = model(** { k: inputs[k] for k in ['input_ids', 'attention_mask'] })

But, then, the result is not what we expected:

import torch
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
outputs = model(**{ k: inputs[k] for k in ['input_ids', 'attention_mask'] })
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)
-----
tensor([[0.9618, 0.0382],
        [0.9350, 0.0650]], grad_fn=<SoftmaxBackward>)

Are you sure you are using the right tokenizer? It doesn’t seem so since you have those token_type_ids added.

There’s a typo in this illustration:

I want to use behind the pipeline as a way to use batches of dataset, is there a way? and is it possible to apply batches using pipeline as we can specify batch size and give the model a dataset instead of examples?

from transformers import TFAutoModel
checkpoint = “distilbert-base-uncased-finetuned-sst-2-english”
model = TFAutoModel.from_pretrained(checkpoint)
outputs = model(inputs)
print(outputs.last_hidden_state.shape)
The output generated from above step is not being used in the next step.

from transformers import TFAutoModelForSequenceClassification
checkpoint = “distilbert-base-uncased-finetuned-sst-2-english”
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(inputs)
print(outputs.logits.shape)

Then why are we generating output from TFAutoModel

Hi, I regret to mention that I am finding it very difficult to follow Mr.Sylvia’s pronunciation. The subtitles seem to be only in French; if they are made available in English too, it will be easier to follow/ understand.

Hi Sylvain,

There is a typo in the mask output in Preprocessing with a tokenizer, Behind the pipeline.

You should make a PR with your fix!