Chapter 6 questions

In our case, we consider hug because it is a strict substring of “hugs”. The notion of strict substring is only used here to select the initial tokens for this toy example (in a real use case, we will use a BPE algorithm for example). Then we calculate their frequency of appearance, independently of the fact that they are a strict substring or not.

1 Like

Hello!
How to do the :pencil2: Try it out! Compute the start and end indices for the five most likely answers.?

Hi @SaulLu,

I agree with @dipetkov on his comment about including “hug” in the frequencies.

I didn’t get the point from your comment, why we include “hug” in the frequencies from “hugs” and “hug”?

Thanks a lot

I am very sorry that this example is confusing.

To give back some context, here we want to show with a very small example how the Unigram algorithm works.

This algorithm, starts with an initial vocabulary which is usually determined by a BPE algorithm. To avoid complicating the toy example here we wanted to take a simpler rule which is “take all strict substrings for the initial vocabulary”.

In concrete terms, we have listed all the strict substrings of the words in the corpus:

  • the strict substrings of "hug" are ['h', 'u', 'g', 'hu', 'ug']
  • the strict substrings of "pug" are ['p', 'u', 'g', 'pu', 'ug']
  • the strict substrings of "pun" are ['p', 'u', 'n', 'pu', 'un']
  • the strict substrings of "bun" are ['b', 'u', 'n', 'bu', 'un']
  • the strict substrings of "hugs" are ['h', 'u', 'g', 's', 'hu', 'ug', 'gs', 'hug', 'ugs']

By merging these lists of strict substrings and by deleting the duplicates, we end up with the initial vocabulary of ['n', 'b', 'g', 'u', 's', 'p', 'h', 'un', 'gs', 'hu', 'ug', 'bu', 'pu', 'ugs', 'hug'].

Now that we have this list our initial vocabulary, we can forget about the notion of strick substrings and move on to the second part of the Unigram algorithm which starts with the calculation of frequencies.

Does this make more sense?

1 Like

Hello everyone,

I was following along the Chapter 6 Part 2 and trying to train a new tokenizer from an old one. And I used AutoTokenizer.train_new_from_iterator() exactly the same as an example. I did not use the same dataset though. The dataset size however is actually bigger (2.2GB). Unfortunately, it took too long (almost an hour and thirty minutes). I realized it was not utilizing all the threads/CPUs that I have. Please see the below screenshot.

I haven’t raised this as an issue on GitHub because I am not sure if there is something that I need to be done first.

k = 5
top_k_scores, top_k_indices = torch.topk(scores.view(-1), k)
start_indices = top_k_indices // scores.size(1)
end_indices = top_k_indices % scores.size(1)

view() is used to flatten the tensor into a 1D tensor of shape (67*67,)

Hi everyone.
First of all, I would like to thank everyone envolved in this community. I am enjoying a lot this course. Thank you very much for the great teachers. I would like to thank as well iotengtr for this hint on converting tensor into a 1D shape.
That said, I would like to contribute here as well. I was having a different response and now I got it.

The first challenge of the chapter 6 asks us to find the top 5 answers.

Running now with transformers 4.27.0.dev0

import transformers
transformers.__version__

I ran the following script:

from transformers import pipeline

model_checkpoint = "distilbert-base-cased-distilled-squad"
question_answerer = pipeline("question-answering", model=model_checkpoint)
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context, top_k=5)

I got the following answer:

[{'score': 0.9802603125572205,
  'start': 78,
  'end': 106,
  'answer': 'Jax, PyTorch, and TensorFlow'},
 {'score': 0.008247781544923782,
  'start': 78,
  'end': 108,
  'answer': 'Jax, PyTorch, and TensorFlow —'},
 {'score': 0.0013677021488547325,
  'start': 78,
  'end': 90,
  'answer': 'Jax, PyTorch'},
 {'score': 0.00038108558510430157,
  'start': 83,
  'end': 106,
  'answer': 'PyTorch, and TensorFlow'},
 {'score': 0.00021684422972612083,
  'start': 96,
  'end': 106,
  'answer': 'TensorFlow'}]

But when I iterate over the tensors and logits, one odd answer arrises:

[{'score': 0.9802601933479309,
  'start': 78,
  'end': 106,
  'answer': 'Jax, PyTorch, and TensorFlow'},
 {'score': 0.008247780613601208,
  'start': 78,
  'end': 108,
  'answer': 'Jax, PyTorch, and TensorFlow —'},
 {'score': 0.0068414947018027306,
  'start': 33,
  'end': 106,
  'answer': 'three most popular deep learning libraries — Jax, PyTorch, and TensorFlow'},
 {'score': 0.0013677021488547325,
  'start': 78,
  'end': 90,
  'answer': 'Jax, PyTorch'},
 {'score': 0.0003810854977928102,
  'start': 83,
  'end': 106,
  'answer': 'PyTorch, and TensorFlow'}]

See that? “three most popular deep learing libraries…”
I was annoyed by that.
But when I got to the source-code of the QuestionAnswering.py I realise what was going on.
Ref: transformers/question_answering.py at main · huggingface/transformers · GitHub
You can pass to the pipeline a max_answer_len.

Knowing that, I changed the first call and then I got the same answer.

question_answerer(question=question, context=context, top_k=5, max_answer_len=150)

Returns:

[{'score': 0.9802603125572205,
  'start': 78,
  'end': 106,
  'answer': 'Jax, PyTorch, and TensorFlow'},
 {'score': 0.008247781544923782,
  'start': 78,
  'end': 108,
  'answer': 'Jax, PyTorch, and TensorFlow —'},
 {'score': 0.0068414960987865925,
  'start': 33,
  'end': 106,
  'answer': 'three most popular deep learning libraries — Jax, PyTorch, and TensorFlow'},
 {'score': 0.0013677021488547325,
  'start': 78,
  'end': 90,
  'answer': 'Jax, PyTorch'},
 {'score': 0.00038108558510430157,
  'start': 83,
  'end': 106,
  'answer': 'PyTorch, and TensorFlow'}]

Problem solved for me!
Thank you again.

And, by the way, my code with the changes requested by the exercise is the following:

(nothing changed here)

from transformers import pipeline
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
question_answerer = pipeline("question-answering", model=model_checkpoint)
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"


model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits


sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [i != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000
start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

scores = start_probabilities[:,None] * end_probabilities[None,:]

(my loop, thanks to iotengtr)

inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

scores = torch.triu(scores)
k = 5
top_k_scores, top_k_indices = torch.topk(scores.view(-1), k)
start_indices = top_k_indices // scores.size(1)
end_indices = top_k_indices % scores.size(1)
exercise_data = []
for i in range(len(start_indices)):
    score__ = top_k_scores[i]
    start_char, _ = offsets[start_indices[i].item()]
    _, end_char = offsets[end_indices[i].item()]
    answer = context[start_char:end_char] 
    exercise_data.append({
        'score':top_k_scores[i].item(),
        'start':start_char,
        'end':end_char,
        'answer':answer
    })

exercise_data
2 Likes

Is there a typo in the line


in the Unigram module at Unigram tokenization - Hugging Face Course?
Based on the frequencies given above that line it looks like
("p", 17) ("pu", 17)
Shouldn’t P(“p”) = 17/210 instead of 5/210? The same thing would also be true in the next line for P(“pu”) where it says it’s 5.
Am I missing something here? Thank you

1 Like

i have the same observation

How’s the tokenization score for “pug” = 0.007710

In an earlier section the tokenization score is mentioned as the following, shouldn’t both the tokenization scores be matching for “pug”, which equals to 0.0022676

Thank you

Imo, as you say, the scores in the two sections should match. Point is, however, that the correct tokenization score for "pug" is indeed 0.007710 rather than 0.0022676.

As pointed out by @Baruch, there are a couple of typos in the Tokenization section. The possible tokenizations of "pug" should be the following:

  • P([‘p’, ‘u’, ‘g’]) = P([‘p’]) * P([‘u’]) * P([‘g’]) = (17/210) * (36/210) * (20/210) = 0.001322
  • P([‘pu’, ‘g’]) = P([‘pu’]) * P([‘g’]) = (17/210) * (20/210) = 0.007710
  • P([‘p’, ‘ug’]) = P([‘p’]) * P([‘ug’]) = (17/210) * (20/210) = 0.007710
  • P([‘pug’]) = P([‘pug’]) = (0/210) = 0

I believe that this should be flagged and corrected.

Addressed in #547.

Great. :grinning:
Hopefully the PR would merged soon!

Hi, in the question-answering pipeline example Fast tokenizers in the QA pipeline - Hugging Face Course, it is shown how to reproduce the confidence score from the logits. The code shown may work for the context and question shown (i.e. computation equals the pipeline score after rounding), but if you change the question to, say,
question = "Who is the president of Hugging Face"
then the answers no longer correspond. This suggests that the computation is not exactly what is going on. Can this be corrected?

Also, I know the assumption of the independence is used in the multiplication is used. However, why is this done? Why don’t you use Bayes’ Rule to convert the softmax of the logits after each start index, by dividing them by the sum after the index (making them valid probabilities after conditioning on them being after the start index) before the multiplication? When I do this, it still doesn’t equal the pipeline result exactly in all cases.

This raises a general question: I would like to either know (e.g. see code) for what is going on at various parts of the pipeline (e.g., so I can reconstruct this calculation and understand it). I can get the attention head values and hidden_states array after running the input manually through the tokenizer and model, but how do I know what is done after that, for instance, to get the logits? I understand pipelines are made to hide the details, but sometimes I want to look under the hood at intermediate products.

Ran following codes (first code from the chapter)

from datasets import load_dataset
# This can take a few minutes to load, so grab a coffee or tea while you wait!
raw_datasets = load_dataset(“code_search_net”, “python”)

and got these errors. Any idea what’s wrong?

Downloading and preparing dataset code_search_net/python to /root/.cache/huggingface/datasets/code_search_net/python/1.0.0/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27…

Downloading data files: 100%
1/1 [00:00<00:00, 36.17it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 40.31it/s]


NotADirectoryError Traceback (most recent call last)
Cell In [7], line 4
1 from datasets import load_dataset
3 # This can take a few minutes to load, so grab a coffee or tea while you wait!
----> 4 raw_datasets = load_dataset(“code_search_net”, “python”)

File /usr/local/lib/python3.8/dist-packages/datasets/load.py:1791, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
1788 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
1790 # Download and prepare data
→ 1791 builder_instance.download_and_prepare(
1792 download_config=download_config,
1793 download_mode=download_mode,
1794 verification_mode=verification_mode,
1795 try_from_hf_gcs=try_from_hf_gcs,
1796 num_proc=num_proc,
1797 storage_options=storage_options,
1798 )
1800 # Build dataset for splits
1801 keep_in_memory = (
1802 keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
1803 )

File /usr/local/lib/python3.8/dist-packages/datasets/builder.py:891, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
889 if num_proc is not None:
890 prepare_split_kwargs[“num_proc”] = num_proc
→ 891 self._download_and_prepare(
892 dl_manager=dl_manager,
893 verification_mode=verification_mode,
894 **prepare_split_kwargs,
895 **download_and_prepare_kwargs,
896 )
897 # Sync info
898 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File /usr/local/lib/python3.8/dist-packages/datasets/builder.py:1651, in GeneratorBasedBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs)
1650 def _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs):
→ 1651 super()._download_and_prepare(
1652 dl_manager,
1653 verification_mode,
1654 check_duplicate_keys=verification_mode == VerificationMode.BASIC_CHECKS
1655 or verification_mode == VerificationMode.ALL_CHECKS,
1656 **prepare_splits_kwargs,
1657 )

File /usr/local/lib/python3.8/dist-packages/datasets/builder.py:964, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
962 split_dict = SplitDict(dataset_name=self.name)
963 split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
→ 964 split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
966 # Checksums verification
967 if verification_mode == VerificationMode.ALL_CHECKS and dl_manager.record_checksums:

File ~/.cache/huggingface/modules/datasets_modules/datasets/code_search_net/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27/code_search_net.py:166, in CodeSearchNet._split_generators(self, dl_manager)
155 data_dirs = [
156 os.path.join(directory, lang, “final”, “jsonl”)
157 for lang, directory in dl_manager.download_and_extract(data_urls).items()
158 ]
160 split2dirs = {
161 split_name: [os.path.join(directory, split_name) for directory in data_dirs]
162 for split_name in [“train”, “test”, “valid”]
163 }
165 split2paths = dl_manager.extract(
→ 166 {
167 split_name: [
168 os.path.join(directory, entry_name)
169 for directory in split_dirs
170 for entry_name in os.listdir(directory)
171 ]
172 for split_name, split_dirs in split2dirs.items()
173 }
174 )
176 return [
177 datasets.SplitGenerator(
178 name=datasets.Split.TRAIN,
(…)
194 ),
195 ]

File ~/.cache/huggingface/modules/datasets_modules/datasets/code_search_net/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27/code_search_net.py:167, in (.0)
155 data_dirs = [
156 os.path.join(directory, lang, “final”, “jsonl”)
157 for lang, directory in dl_manager.download_and_extract(data_urls).items()
158 ]
160 split2dirs = {
161 split_name: [os.path.join(directory, split_name) for directory in data_dirs]
162 for split_name in [“train”, “test”, “valid”]
163 }
165 split2paths = dl_manager.extract(
166 {
→ 167 split_name: [
168 os.path.join(directory, entry_name)
169 for directory in split_dirs
170 for entry_name in os.listdir(directory)
171 ]
172 for split_name, split_dirs in split2dirs.items()
173 }
174 )
176 return [
177 datasets.SplitGenerator(
178 name=datasets.Split.TRAIN,
(…)
194 ),
195 ]

File ~/.cache/huggingface/modules/datasets_modules/datasets/code_search_net/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27/code_search_net.py:170, in (.0)
155 data_dirs = [
156 os.path.join(directory, lang, “final”, “jsonl”)
157 for lang, directory in dl_manager.download_and_extract(data_urls).items()
158 ]
160 split2dirs = {
161 split_name: [os.path.join(directory, split_name) for directory in data_dirs]
162 for split_name in [“train”, “test”, “valid”]
163 }
165 split2paths = dl_manager.extract(
166 {
167 split_name: [
168 os.path.join(directory, entry_name)
169 for directory in split_dirs
→ 170 for entry_name in os.listdir(directory)
171 ]
172 for split_name, split_dirs in split2dirs.items()
173 }
174 )
176 return [
177 datasets.SplitGenerator(
178 name=datasets.Split.TRAIN,
(…)
194 ),
195 ]

File /usr/local/lib/python3.8/dist-packages/datasets/streaming.py:71, in extend_module_for_streaming..wrap_auth..wrapper(*args, **kwargs)
69 @wraps(function)
70 def wrapper(*args, **kwargs):
—> 71 return function(*args, use_auth_token=use_auth_token, **kwargs)

File /usr/local/lib/python3.8/dist-packages/datasets/download/streaming_download_manager.py:524, in xlistdir(path, use_auth_token)
522 main_hop, *rest_hops = _as_str(path).split(“::”)
523 if is_local_path(main_hop):
→ 524 return os.listdir(path)
525 else:
526 # globbing inside a zip in a private repo requires authentication
527 if not rest_hops and (main_hop.startswith(“http://”) or main_hop.startswith(“https://”)):

NotADirectoryError: [Errno 20] Not a directory: ‘/root/.cache/huggingface/datasets/downloads/25ceeb4c25ab737d688bd56ea92bfbb1f199fe572470456cf2d675479f342ac7/python/final/jsonl/train’

Chapter 6.2 Training New Tokenizer.
Code doesn’t run. It fails during download of the dataset.

I had post a question with error trace but that was flagged as spam.
I hope you guys check the spam as well.

hello,
i create for me my tokenizer, but it can not encode some emotion character like :grin:, it will be token. Please help me fix this, thanks you.
pls: this is my code:
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace,ByteLevel
from tokenizers.trainers import BpeTrainer

Initialize the tokenizer with BPE model

tokenizer = Tokenizer(BPE(unk_token=“”,
pad_token=“”,
cls_token=“”,
sep_token=“
”,
mask_token=“”,
add_prefix_space=True))

Define the trainer

trainer = BpeTrainer(special_tokens=[“”, “”, “”, “”, “”]+emo,
min_frequency=10, # Minimum frequency of a token to be included
vocab_size=15000, # Maximum vocabulary size
show_progress=True)

Train the tokenizer from the text

tokenizer.train_from_iterator((line.strip() for line in lines), trainer=trainer)

tokenizer.add_special_tokens([“”, “”, “”, “”])

Set the pre-tokenization method to ByteLevel

tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)

Save the tokenizer to a file

tokenizer.save(“tokenizer.json”)

Hi,
I am retraining t5-base tokenizer on my custom domain dataset but I am getting confuse between tokenizer.train_new_from_iterator() and tokenizer.add_tokens().

In the course of Chapter 6, it says "At this stage, we could take the argmax of the start and end probabilities — but we might end up with a start index that is greater than the end index, so we need to take a few more precautions. "

Can someone explain why?