Chapter 6 questions

Use this topic for any question about Chapter 6 of the course.

Hi,

In the section “Fast tokenizers’ special powers” (Tensorflow tutorial) executing this part of the code triggers an error:

Hey @Lenn thanks for reporting this! A quick fix would be to use outputs.logits:

import tensorflow as tf

probabilities = tf.math.softmax(outputs.logits, axis=-1)[0]
probabilities = probabilities.numpy().tolist()
predictions = tf.math.argmax(outputs.logits, axis=-1)[0]
predictions = predictions.numpy().tolist()
print(predictions)

We’ll update the notebooks later this week!

1 Like

Hello

I’ve got a question about computing scores by hand in the WordPiece section. The example says:

The most frequent pair is ("##u", “##g”) (present 20 times), but the individual frequency of “##u” is very high, so its score is not the highest (it’s 1 / 26). All pairs with a “##u” actually have that same score (1 / 26).

Isn’t the score 1 / 36 instead of 1 / 26 since freq("##u", "##g") = 20, freq("##u") = 36, freq("##g") = 20.

Hello @dipetkov ,

That’s a good point! I agree with your calculations, we’ll try to make the necessary modifications quickly in the course! :blush:

This is more a question about training token classification models, but its regarding this statement from chapter 6:

Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words.

As such, token classification models may be trained using a list of words that have been split out differently than the given model’s tokenizer would have. Given this:

  1. Does this have any practical effect on performance?

  2. Would it be better to use architectures that split words apart the same way it’s training inputs were split? And if so, what is the best way to determine whether your training inputs (your lists of words) mesh with how your tokenizer would split them out?

Hello
Have more questions about tokenizers. I enjoyed this section a lot!

This question about WordPiece tokenization. In the compute_pair_scores(splits) function, why add 1 and not freq on line 7?

6:    if len(split) == 1:
7:        letter_freqs[split[0]] += 1  # Shouldn't we add freq instead of 1?

The next question is about the “Fast tokenizers’ special powers” section, and in particular, the algorithm to group entities.

Say the model decides to split (incorrectly) “Hugging Face” into two entities “Hugging” and “Face”. [This happened when I was trying out a smaller model.]

# # Hugging Face is one entity
# predictions = [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]
# # Hugging Face is two entities
predictions = [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 5, 0, 8, 0, 0]

Then “Face” is not included in the named entity results.

entity_group	score	    word	start     end
         PER    0.998169    Sylvain     11     18
         ORG    0.975004    Hugging     33     40
         LOC    0.993211    Brooklyn    49     57

I have a reprex to reproduce this edge case and propose a solution.

reprex
import numpy as np
import pandas as pd
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()

inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

# # Hugging Face is one entity
# predictions = [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]
# # Hugging Face is two entities
predictions = [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 5, 0, 8, 0, 0]

results = []

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]

    # Original solution: skips over the second of two consecutive named entities.
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

    # # Proposed solution: recognizes both of two consecutive named entities.
    # if label != "O":
    #     # Remove the B- or I- prefix
    #     label = label[2:]
    #     start, end = offsets[idx]
    #     all_scores = [probabilities[idx][pred]]
    #
    #     # Grab all subsequent tokens with the same I-label, if any
    #     while (
    #         idx + 1 < len(predictions)
    #         and model.config.id2label[predictions[idx + 1]] == f"I-{label}"
    #     ):
    #         idx += 1
    #         all_scores.append(probabilities[idx][pred])
    #         _, end = offsets[idx]

        # Take the mean of all token scores in the grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

pd.DataFrame(results)

Comment about the counting subwords in the Unigram tokenization section: (“hug”, 15) should be (“hug”, 5) because “hug” appears as a strict substring only in “hugs”. Unfortunately this affects all the subsequent computations. For example, the sum of all frequencies is 200, not 210.

Hey @wgpubs in general you’ll get gibberish in your decoded outputs if you use a different tokenizer to the one associated with a model checkpoint.

The main reason why is that the vocabularies will typically be different, and the model’s embedding layers assume that the mapping from token to input ID is consistent with the one defined during pretraining.