Chapter 6 questions

sgugger · November 15, 2021, 2:13pm

Use this topic for any question about Chapter 6 of the course.

Lenn · November 16, 2021, 10:07am

Hi,

In the section “Fast tokenizers’ special powers” (Tensorflow tutorial) executing this part of the code triggers an error:

lewtun · November 16, 2021, 10:25am

Hey @Lenn thanks for reporting this! A quick fix would be to use outputs.logits:

import tensorflow as tf

probabilities = tf.math.softmax(outputs.logits, axis=-1)[0]
probabilities = probabilities.numpy().tolist()
predictions = tf.math.argmax(outputs.logits, axis=-1)[0]
predictions = predictions.numpy().tolist()
print(predictions)

We’ll update the notebooks later this week!

dipetkov · December 30, 2021, 2:42pm

Hello

I’ve got a question about computing scores by hand in the WordPiece section. The example says:

The most frequent pair is (“##u”, “##g”) (present 20 times), but the individual frequency of “##u” is very high, so its score is not the highest (it’s 1 / 26). All pairs with a “##u” actually have that same score (1 / 26).

Isn’t the score 1 / 36 instead of 1 / 26 since freq("##u", "##g") = 20, freq("##u") = 36, freq("##g") = 20.

SaulLu · January 5, 2022, 8:50pm

Hello @dipetkov ,

That’s a good point! I agree with your calculations, we’ll try to make the necessary modifications quickly in the course!

wgpubs · January 6, 2022, 4:00am

This is more a question about training token classification models, but its regarding this statement from chapter 6:

Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words.

As such, token classification models may be trained using a list of words that have been split out differently than the given model’s tokenizer would have. Given this:

Does this have any practical effect on performance?
Would it be better to use architectures that split words apart the same way it’s training inputs were split? And if so, what is the best way to determine whether your training inputs (your lists of words) mesh with how your tokenizer would split them out?

dipetkov · January 7, 2022, 2:01pm

Hello
Have more questions about tokenizers. I enjoyed this section a lot!

This question about WordPiece tokenization. In the compute_pair_scores(splits) function, why add 1 and not freq on line 7?

6:    if len(split) == 1:
7:        letter_freqs[split[0]] += 1  # Shouldn't we add freq instead of 1?

dipetkov · January 7, 2022, 2:15pm

The next question is about the “Fast tokenizers’ special powers” section, and in particular, the algorithm to group entities.

Say the model decides to split (incorrectly) “Hugging Face” into two entities “Hugging” and “Face”. [This happened when I was trying out a smaller model.]

# # Hugging Face is one entity
# predictions = [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]
# # Hugging Face is two entities
predictions = [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 5, 0, 8, 0, 0]

Then “Face” is not included in the named entity results.

entity_group	score	    word	start     end
         PER    0.998169    Sylvain     11     18
         ORG    0.975004    Hugging     33     40
         LOC    0.993211    Brooklyn    49     57

I have a reprex to reproduce this edge case and propose a solution.

reprex

import numpy as np
import pandas as pd
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()

inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

# # Hugging Face is one entity
# predictions = [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]
# # Hugging Face is two entities
predictions = [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 5, 0, 8, 0, 0]

results = []

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]

    # Original solution: skips over the second of two consecutive named entities.
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

    # # Proposed solution: recognizes both of two consecutive named entities.
    # if label != "O":
    #     # Remove the B- or I- prefix
    #     label = label[2:]
    #     start, end = offsets[idx]
    #     all_scores = [probabilities[idx][pred]]
    #
    #     # Grab all subsequent tokens with the same I-label, if any
    #     while (
    #         idx + 1 < len(predictions)
    #         and model.config.id2label[predictions[idx + 1]] == f"I-{label}"
    #     ):
    #         idx += 1
    #         all_scores.append(probabilities[idx][pred])
    #         _, end = offsets[idx]

        # Take the mean of all token scores in the grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

pd.DataFrame(results)

dipetkov · January 8, 2022, 3:16pm

Comment about the counting subwords in the Unigram tokenization section: (“hug”, 15) should be (“hug”, 5) because “hug” appears as a strict substring only in “hugs”. Unfortunately this affects all the subsequent computations. For example, the sum of all frequencies is 200, not 210.

lewtun · January 21, 2022, 2:42pm

Hey @wgpubs in general you’ll get gibberish in your decoded outputs if you use a different tokenizer to the one associated with a model checkpoint.

The main reason why is that the vocabularies will typically be different, and the model’s embedding layers assume that the mapping from token to input ID is consistent with the one defined during pretraining.

SaulLu · January 28, 2022, 5:47pm

Thanks for sharing your discovery and the nice words about the course, indeed I also think it should be freq instead of 1. We will change that in the course.

SaulLu · January 28, 2022, 5:53pm

In this section, what we want to do is create a large initial vocabulary for our example. We choose to put in our initial vocabulary all the strict substrings.

Then, to calculate the frequencies of these tokens in the training dataset we have to take all of them into account, that’s why the number of occurrences of “hug” is 15.

Does this answer your question?

dipetkov · March 8, 2022, 3:56pm

Now that we add freq on line 7 in compute_pair_scores, lines 6-8 are redundant altogether. We don’t need the if statement because when there is a single letter, the execution doesn’t enter the for loop and split[0] is the same as split[-1].

dipetkov · March 8, 2022, 4:01pm

I still think the count for “hug” is off if we count strict substrings. At least I got 5 for “hug” when I implemented the first exercise to “Write the code to compute the the frequencies above and double-check that the results shown are correct, as well as the total sum.” That’s because “hug” appears as a strict substring only in “hugs” which occurs 5 times.

lewtun · March 9, 2022, 10:59am

Hi @dipetkov, I’m not sure about removing the if statement here because doing so will double-count the letter frequencies of the splits that contain more than one token. In other words, if we do this:

def compute_pair_scores(splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        letter_freqs[split[0]] += freq # This will count the frequency of single token and pairs of tokens
        for i in range(len(split) - 1):
            if len(split) == 1:
            pair = (split[i], split[i + 1])
            letter_freqs[split[i]] += freq # Double counting occurs here for the first token in a pair
            pair_freqs[pair] += freq
        letter_freqs[split[-1]] += freq

    scores = {
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }
    return scores

Perhaps you can share the version of compute_pair_words() that you think is correct?

lewtun · March 9, 2022, 11:12am

I think this might depend on one’s definition of “strict substring”. What we had in mind was the conventional definition that a string is a substring of itself. Here’s an example from Wikipedia that illustrates the point:

The list of all substrings of the string " apple " would be " apple ", " appl ", " pple ", " app ", " ppl ", " ple ", " ap ", " pp ", " pl ", " le ", " a ", " p ", " l ", " e ", “” (note the empty string at the end).

By analogy, the list of all substrings of “hug” would be “hug”, “hu”, “ug”, “h”, “u”, “g”, “”

Did you have a different definition in mind?

dipetkov · March 12, 2022, 11:01pm

@lewtun I admit that by “strict substring” I understood that a string is not its own substring. But with either definition there is an issue. Either that or I am confused about something obvious to everyone else.

So this is the data. It’s taken directly from the chapter on Unicode tokenization.

corpus = [
    ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
]
substrings = [
    "h", "u", "g", "hu", "ug", "p", "pu", "n", "un", "b", "bu",
    "s", "hug", "gs", "ugs"
]
frequencies = [
    ("h", 15), ("u", 36), ("g", 20), ("hu", 15), ("ug", 20),
    ("p", 17), ("pu", 17), ("n", 16), ("un", 16), ("b", 4),
    ("bu", 4), ("s", 5), ("hug", 15), ("gs", 5), ("ugs", 5)
]

If a string is its own substring, why include “hug” but no “pug”, “pun” or “bun” in the frequency list?
If a string is not its own substring, then the frequency for “hug” is not correct.

And here is my code for counting subwords, with either definition of “strict”.

reprex

from collections import Counter
from itertools import combinations


def pretokenize(text):
    # for word, _ in tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text):
    for word in text.split():
        yield word


def get_strict_substrings(string, min_length=1, max_length=999, strict=True):
    n = len(string)
    return {
        string[i:j]
        # Generate all combinations of start and end positions
        for i, j in combinations(range(n + 1), r=2)
        if min_length <= j - i <= max_length
        and (not strict or (i, j) != (0, n))
    }


def count_subwords(corpus, min_length=1, max_length=999, strict=True):
    return Counter(
        subword
        for text in corpus
        for word in pretokenize(text)
        for subword in get_strict_substrings(word,
                                             min_length=min_length,
                                             max_length=max_length,
                                             strict=strict)
    )


corpus = ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
text = [word for word, reps in corpus for rep in range(reps)]

count_subwords(text, strict=True)
count_subwords(text, strict=False)

dipetkov · March 12, 2022, 11:30pm

@lewtun Here is my implementation. It just removes the condition checking if len(split) == 1.

Note however that what you’ve copied above is not what’s currently shown in the WordPiece tokenization section of the course. The line

letter_freqs[split[0]] += freq # This will count the frequency of single token and pairs of tokens

is missing from the course webpage altogether and the ordering of the rest is slightly different.

I’ve assumed that the implementation on the course website is correct.

reprex

from collections import defaultdict


# Original implementation. I've just added `word_freqs` as an input argument.
def compute_pair_scores(word_freqs, splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)

    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            letter_freqs[split[0]] += freq
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            letter_freqs[split[i]] += freq
            pair_freqs[pair] += freq
        letter_freqs[split[-1]] += freq

    scores = {
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }
    return scores


def compute_pair_scores2(word_freqs, splits):
    token_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)

    for (word, freq), tokens in zip(word_freqs.items(), splits):
        # Don't check if `len(split) == 1`
        for i in range(len(tokens) - 1):
            pair = (tokens[i], tokens[i + 1])
            token_freqs[tokens[i]] += freq
            pair_freqs[pair] += freq
        # Since even if `len(split) == 1`,
        # we will skip the loop and will get to this line 
        # and we will add the frequency of the singleton.
        token_freqs[tokens[-1]] += freq

    return {
        pair: freq / (token_freqs[pair[0]] * token_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }


word_freqs = defaultdict(int, {"a": 1, "b": 2, "ab": 4, "abc": 6})
splits = {"a": ["a"], "b": ["b"], "ab": ["a", "##b"], "abc": ["a", "##b", "##c"]}
compute_pair_scores(word_freqs, splits)

# My implementation assumes that `splits` is a list that
# contains one element for each word in `word_freqs`.
splits = splits.values()
compute_pair_scores2(word_freqs, splits)

ollibolli · March 13, 2022, 1:20pm

Hi,

I’ve implemented and used your code to make a Q&A pipeline.
It works quite nice.
Now I’m trying to do inference over a large dataset and iterating over it is too slow (It’s for the kaggle student nlp comp and there is a 9 hour limit).
So I try to feed it a dataset as per the docs.
I.e.:
pipe(test_set_val, batch_size = 8, total =len(test_set_val))
However I get the following error:
KeyError: 'You need to provide a dictionary with keys {question:..., context:...}'
Which I could fix by iterating over the dataset, but then it’s too slow again.

Is there someway to feed a dataset for Q&A pipeline like for the other pipelines?

lewtun · March 21, 2022, 5:35pm

Hi @ollibolli I recommend checking out the section on pipeline chunk batching in the docs, as well as the preceding section on pipeline batching more generally. I think that should provide you with a way to speed up your QA pipeline

Topic		Replies	Views
Chapter 7 questions Course	119	10365	July 10, 2025
Tokenizer unigram tutorial encode_word function question Beginners	0	90	May 11, 2024
Train Retry Tokenizer 🤗Tokenizers	0	223	April 18, 2023
SentencePieceUnigramTokenizer 🤗Tokenizers	0	691	September 22, 2022
Transformers v3.0.0 is out! 🤗Transformers	0	1938	July 7, 2020

Chapter 6 questions

Related topics