# Chapter 6 questions

Use this topic for any question about Chapter 6 of the course.

Hi,

In the section â€śFast tokenizersâ€™ special powersâ€ť (Tensorflow tutorial) executing this part of the code triggers an error:

Hey @Lenn thanks for reporting this! A quick fix would be to use outputs.logits:

import tensorflow as tf

probabilities = tf.math.softmax(outputs.logits, axis=-1)[0]
probabilities = probabilities.numpy().tolist()
predictions = tf.math.argmax(outputs.logits, axis=-1)[0]
predictions = predictions.numpy().tolist()
print(predictions)

Weâ€™ll update the notebooks later this week!

1 Like

Hello

Iâ€™ve got a question about computing scores by hand in the WordPiece section. The example says:

The most frequent pair is (â€ś##uâ€ť, â€ś##gâ€ť) (present 20 times), but the individual frequency of â€ś##uâ€ť is very high, so its score is not the highest (itâ€™s 1 / 26). All pairs with a â€ś##uâ€ť actually have that same score (1 / 26).

Isnâ€™t the score 1 / 36 instead of 1 / 26 since freq("##u", "##g") = 20, freq("##u") = 36, freq("##g") = 20.

Hello @dipetkov ,

Thatâ€™s a good point! I agree with your calculations, weâ€™ll try to make the necessary modifications quickly in the course!

This is more a question about training token classification models, but its regarding this statement from chapter 6:

Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words.

As such, token classification models may be trained using a list of words that have been split out differently than the given modelâ€™s tokenizer would have. Given this:

1. Does this have any practical effect on performance?

2. Would it be better to use architectures that split words apart the same way itâ€™s training inputs were split? And if so, what is the best way to determine whether your training inputs (your lists of words) mesh with how your tokenizer would split them out?

Hello
Have more questions about tokenizers. I enjoyed this section a lot!

This question about WordPiece tokenization. In the compute_pair_scores(splits) function, why add 1 and not freq on line 7?

6:    if len(split) == 1:
7:        letter_freqs[split[0]] += 1  # Shouldn't we add freq instead of 1?

The next question is about the â€śFast tokenizersâ€™ special powersâ€ť section, and in particular, the algorithm to group entities.

Say the model decides to split (incorrectly) â€śHugging Faceâ€ť into two entities â€śHuggingâ€ť and â€śFaceâ€ť. [This happened when I was trying out a smaller model.]

# # Hugging Face is one entity
# predictions = [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]
# # Hugging Face is two entities
predictions = [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 5, 0, 8, 0, 0]

Then â€śFaceâ€ť is not included in the named entity results.

entity_group	score	    word	start     end
PER    0.998169    Sylvain     11     18
ORG    0.975004    Hugging     33     40
LOC    0.993211    Brooklyn    49     57

I have a reprex to reproduce this edge case and propose a solution.

reprex
import numpy as np
import pandas as pd
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()

inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

# # Hugging Face is one entity
# predictions = [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]
# # Hugging Face is two entities
predictions = [0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 5, 0, 8, 0, 0]

results = []

idx = 0
while idx < len(predictions):
pred = predictions[idx]
label = model.config.id2label[pred]

# Original solution: skips over the second of two consecutive named entities.
if label != "O":
# Remove the B- or I-
label = label[2:]
start, _ = offsets[idx]

# Grab all the tokens labeled with I-label
all_scores = []
while (
idx < len(predictions)
and model.config.id2label[predictions[idx]] == f"I-{label}"
):
all_scores.append(probabilities[idx][pred])
_, end = offsets[idx]
idx += 1

# # Proposed solution: recognizes both of two consecutive named entities.
# if label != "O":
#     # Remove the B- or I- prefix
#     label = label[2:]
#     start, end = offsets[idx]
#     all_scores = [probabilities[idx][pred]]
#
#     # Grab all subsequent tokens with the same I-label, if any
#     while (
#         idx + 1 < len(predictions)
#         and model.config.id2label[predictions[idx + 1]] == f"I-{label}"
#     ):
#         idx += 1
#         all_scores.append(probabilities[idx][pred])
#         _, end = offsets[idx]

# Take the mean of all token scores in the grouped entity
score = np.mean(all_scores).item()
word = example[start:end]
results.append(
{
"entity_group": label,
"score": score,
"word": word,
"start": start,
"end": end,
}
)
idx += 1

pd.DataFrame(results)

Comment about the counting subwords in the Unigram tokenization section: (â€śhugâ€ť, 15) should be (â€śhugâ€ť, 5) because â€śhugâ€ť appears as a strict substring only in â€śhugsâ€ť. Unfortunately this affects all the subsequent computations. For example, the sum of all frequencies is 200, not 210.

Hey @wgpubs in general youâ€™ll get gibberish in your decoded outputs if you use a different tokenizer to the one associated with a model checkpoint.

The main reason why is that the vocabularies will typically be different, and the modelâ€™s embedding layers assume that the mapping from token to input ID is consistent with the one defined during pretraining.

1 Like

Thanks for sharing your discovery and the nice words about the course, indeed I also think it should be freq instead of 1. We will change that in the course.

1 Like

In this section, what we want to do is create a large initial vocabulary for our example. We choose to put in our initial vocabulary all the strict substrings.

Then, to calculate the frequencies of these tokens in the training dataset we have to take all of them into account, thatâ€™s why the number of occurrences of â€śhugâ€ť is 15.

Does this answer your question?

1 Like

Now that we add freq on line 7 in compute_pair_scores, lines 6-8 are redundant altogether. We donâ€™t need the if statement because when there is a single letter, the execution doesnâ€™t enter the for loop and split[0] is the same as split[-1].

I still think the count for â€śhugâ€ť is off if we count strict substrings. At least I got 5 for â€śhugâ€ť when I implemented the first exercise to â€śWrite the code to compute the the frequencies above and double-check that the results shown are correct, as well as the total sum.â€ť Thatâ€™s because â€śhugâ€ť appears as a strict substring only in â€śhugsâ€ť which occurs 5 times.

Hi @dipetkov, Iâ€™m not sure about removing the if statement here because doing so will double-count the letter frequencies of the splits that contain more than one token. In other words, if we do this:

def compute_pair_scores(splits):
letter_freqs = defaultdict(int)
pair_freqs = defaultdict(int)
for word, freq in word_freqs.items():
split = splits[word]
letter_freqs[split[0]] += freq # This will count the frequency of single token and pairs of tokens
for i in range(len(split) - 1):
if len(split) == 1:
pair = (split[i], split[i + 1])
letter_freqs[split[i]] += freq # Double counting occurs here for the first token in a pair
pair_freqs[pair] += freq
letter_freqs[split[-1]] += freq

scores = {
pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
for pair, freq in pair_freqs.items()
}
return scores

Perhaps you can share the version of compute_pair_words() that you think is correct?

I think this might depend on oneâ€™s definition of â€śstrict substringâ€ť. What we had in mind was the conventional definition that a string is a substring of itself. Hereâ€™s an example from Wikipedia that illustrates the point:

The list of all substrings of the string " apple " would be " apple ", " appl ", " pple ", " app ", " ppl ", " ple ", " ap ", " pp ", " pl ", " le ", " a ", " p ", " l ", " e ", â€śâ€ť (note the empty string at the end).

By analogy, the list of all substrings of â€śhugâ€ť would be â€śhugâ€ť, â€śhuâ€ť, â€śugâ€ť, â€śhâ€ť, â€śuâ€ť, â€śgâ€ť, â€śâ€ť

Did you have a different definition in mind?

1 Like

@lewtun I admit that by â€śstrict substringâ€ť I understood that a string is not its own substring. But with either definition there is an issue. Either that or I am confused about something obvious to everyone else.

So this is the data. Itâ€™s taken directly from the chapter on Unicode tokenization.

corpus = [
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
]
substrings = [
"h", "u", "g", "hu", "ug", "p", "pu", "n", "un", "b", "bu",
"s", "hug", "gs", "ugs"
]
frequencies = [
("h", 15), ("u", 36), ("g", 20), ("hu", 15), ("ug", 20),
("p", 17), ("pu", 17), ("n", 16), ("un", 16), ("b", 4),
("bu", 4), ("s", 5), ("hug", 15), ("gs", 5), ("ugs", 5)
]
• If a string is its own substring, why include â€śhugâ€ť but no â€śpugâ€ť, â€śpunâ€ť or â€śbunâ€ť in the frequency list?
• If a string is not its own substring, then the frequency for â€śhugâ€ť is not correct.

And here is my code for counting subwords, with either definition of â€śstrictâ€ť.

reprex
from collections import Counter
from itertools import combinations

def pretokenize(text):
# for word, _ in tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text):
for word in text.split():
yield word

def get_strict_substrings(string, min_length=1, max_length=999, strict=True):
n = len(string)
return {
string[i:j]
# Generate all combinations of start and end positions
for i, j in combinations(range(n + 1), r=2)
if min_length <= j - i <= max_length
and (not strict or (i, j) != (0, n))
}

def count_subwords(corpus, min_length=1, max_length=999, strict=True):
return Counter(
subword
for text in corpus
for word in pretokenize(text)
for subword in get_strict_substrings(word,
min_length=min_length,
max_length=max_length,
strict=strict)
)

corpus = ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
text = [word for word, reps in corpus for rep in range(reps)]

count_subwords(text, strict=True)
count_subwords(text, strict=False)
1 Like

@lewtun Here is my implementation. It just removes the condition checking if len(split) == 1.

Note however that what youâ€™ve copied above is not whatâ€™s currently shown in the WordPiece tokenization section of the course. The line

letter_freqs[split[0]] += freq # This will count the frequency of single token and pairs of tokens

is missing from the course webpage altogether and the ordering of the rest is slightly different.

Iâ€™ve assumed that the implementation on the course website is correct.

reprex
from collections import defaultdict

# Original implementation. I've just added `word_freqs` as an input argument.
def compute_pair_scores(word_freqs, splits):
letter_freqs = defaultdict(int)
pair_freqs = defaultdict(int)

for word, freq in word_freqs.items():
split = splits[word]
if len(split) == 1:
letter_freqs[split[0]] += freq
continue
for i in range(len(split) - 1):
pair = (split[i], split[i + 1])
letter_freqs[split[i]] += freq
pair_freqs[pair] += freq
letter_freqs[split[-1]] += freq

scores = {
pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
for pair, freq in pair_freqs.items()
}
return scores

def compute_pair_scores2(word_freqs, splits):
token_freqs = defaultdict(int)
pair_freqs = defaultdict(int)

for (word, freq), tokens in zip(word_freqs.items(), splits):
# Don't check if `len(split) == 1`
for i in range(len(tokens) - 1):
pair = (tokens[i], tokens[i + 1])
token_freqs[tokens[i]] += freq
pair_freqs[pair] += freq
# Since even if `len(split) == 1`,
# we will skip the loop and will get to this line
# and we will add the frequency of the singleton.
token_freqs[tokens[-1]] += freq

return {
pair: freq / (token_freqs[pair[0]] * token_freqs[pair[1]])
for pair, freq in pair_freqs.items()
}

word_freqs = defaultdict(int, {"a": 1, "b": 2, "ab": 4, "abc": 6})
splits = {"a": ["a"], "b": ["b"], "ab": ["a", "##b"], "abc": ["a", "##b", "##c"]}
compute_pair_scores(word_freqs, splits)

# My implementation assumes that `splits` is a list that
# contains one element for each word in `word_freqs`.
splits = splits.values()
compute_pair_scores2(word_freqs, splits)

Hi,

Iâ€™ve implemented and used your code to make a Q&A pipeline.
It works quite nice.
Now Iâ€™m trying to do inference over a large dataset and iterating over it is too slow (Itâ€™s for the kaggle student nlp comp and there is a 9 hour limit).
So I try to feed it a dataset as per the docs.
I.e.:
pipe(test_set_val, batch_size = 8, total =len(test_set_val))
However I get the following error:
KeyError: 'You need to provide a dictionary with keys {question:..., context:...}'
Which I could fix by iterating over the dataset, but then itâ€™s too slow again.

Is there someway to feed a dataset for Q&A pipeline like for the other pipelines?

Hi @ollibolli I recommend checking out the section on pipeline chunk batching in the docs, as well as the preceding section on pipeline batching more generally. I think that should provide you with a way to speed up your QA pipeline