Chapter 2 questions

lewtun · November 22, 2022, 10:29am

Hi @vsrinivas we’re exploring the possibility to add transcripts for all the videos (English PR here). If this works well on YouTube, we’ll open a call for the community to help translate them or use Whisper

lewtun · November 25, 2022, 1:06pm

Hi @vsrinivas I’m happy to share that the English subtitles are now available for most of the course videos! Hope this helps

vsrinivas · November 25, 2022, 5:08pm

Thanks a lot for letting me know. Really appreciate it.

Regards,
Srinivas

xuankai91 · December 15, 2022, 6:29am

The Models expect a batch of inputs section might be outdated? It seems that tokenizers have been updated to account for single-dimensional input - I couldn’t reproduce the error in the first code snippet.

As of Dec 2022, the lines

input_ids = tf.constant(ids)
input_ids = tf.constant([ids])

produce the exact same output object.

zenkat · December 29, 2022, 8:19pm

The course content about model.save_pretrained() is slightly misleading. It gives the example:

model.save_pretrained("directory_on_my_computer")

and then says it will save two files to “your disk”.

Even though I am using coLab, my assumption was that this would save two files to my local disk, which meant a fair amount of fruitless time fiddling around to try to find the files.

It might be helpful for other newbies to specify that the files as saved to your coLab files (and that you may need to refresh your file view for the new directory to show up).

Thanks for a great course!

zenkat · January 2, 2023, 2:40am

When I run the steps of the sentiment-analysis pipeline sequentially, I get almost the same result, but the precision is different. The pipelined run gives me 15 digits of precision (eg “0.993750274181366”) but running the steps sequentially only gives me 7 digits of precision (eg “0.9937503”).

Is there a way to set the desired precision?

EdoGram · January 6, 2023, 2:15pm

It is said that BERT could be used directly for inferring task. I thought BERT was a encoder, and therefore unable to make prediction.

Lazn · January 6, 2023, 3:36pm

After a lot of research I couldn’t find an answer to this question:

I have a local model that was pre trained on the text regression task
and this model makes the prediction with a single output (decimal number between 0 and 1)

so how can I load this model and test it? using which class?
without any modification on the architecture of the model

thank you

Owos · January 8, 2023, 2:11am

Multilingual Tokinzer

List item
Please, how does a Tokenizer work for a multilingual dataset?
Who gets to add the special tokens? me or the tokenizer?
Can I customise my own special tokens other than what comes with AutoTokenizer? Say [Unq] instead [CLS].

I’ll appreciate if I could get some explanations to this, especially to the multilingual part.

thesunshine36 · January 30, 2023, 8:26am

How to process long sequences if i do not want to truncate to a fixed length. Some sequences will lose information.

toncho11 · February 3, 2023, 10:51am

I just wanted to note that the concept of “checkpoints” is not introduced (or at least I did not see it). Still this concept is used everywhere.

captainst · March 8, 2023, 4:54am

The special tokens of gpt2 from tokenizer

From this , the gpt2 tokenizer has defined bos_token and eos_token. However, when I tried with:

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokens = tokenizer("Using a Transformer network is simple")
print(tokenizer.decode(tokens["input_ids"]))

I do not see the token of the begin and end of sentences. Does it mean that the gpt2 model was trained without these tokens ?

Many thanks !

4nn4r · May 29, 2023, 11:36am

From the API documentation of tokenizers I see the add_special_tokens parameter for the encode method: Tokenizer

From Putting it all together - Hugging Face NLP Course of this course:
" The tokenizer added the special word [CLS] at the beginning and the special word [SEP] at the end. This is because the model was pretrained with those, so to get the same results for inference we need to add them as well."
So, I was wondering, when should I put the add_special_tokens parameter to false?

Roy2358 · July 26, 2023, 9:07am

Hey!

I tried to feed individual tokenized sentence ids to the model and the a batch of different sentences. (As asked in the challenge in the chapter) I am not sure why I get different logit values for the two methods, am I doing something wrong? My notebooks hosted here: https://github.com/ArindamRoy23/Huggin_Faces_Course/blob/main/HF_Using_Transformers.ipynb

ambytious · August 13, 2023, 4:14pm

I tried the step by step method given in chapter 2 → “Behind the pipeline” section and it gave me wrong results. I have a feeling that it is giving me the results from my last run and that result is cached. Is this possible?

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_inputs = [
    "It’s not a good app.",
    "My experience was really bad.",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

print(outputs.logits.shape)
print(outputs.logits)

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

result = model.config.id2label
print(result)```

Results:

torch.Size([2, 10, 768])
torch.Size([2, 2])
tensor([[ 4.7721, -3.7753],
[ 4.6483, -3.7990]], grad_fn=)
tensor([[9.9981e-01, 1.9402e-04],
[9.9979e-01, 2.1442e-04]], grad_fn=)
{0: ‘NEGATIVE’, 1: ‘POSITIVE’}

belalamin · August 15, 2023, 6:48pm

in the Tokenization section, do we load and save tokenizers in order to enrich its vocab and train it? tokenizer(“Using a Transformer network is simple”) this line for example, what is the purpose of doing that and then saving it? is this how we enrich our tokenizer?

belalamin · August 15, 2023, 8:47pm

Concerning the maximum length of tokens that a model can accept, is that because we use a pre-trained mode? it is possible to initialize a model from scratch and train it in a way that make it accept longer sequences?

zongxiao · August 30, 2023, 7:46am

Let’s say we have a couple of sequences:

Copied
sequences = [“Hello!”, “Cool.”, “Nice!”]
The tokenizer converts these to vocabulary indices which are typically called input IDs. Each sequence is now a list of numbers! The resulting output is:

Copied
encoded_sequences = [
[101, 7592, 999, 102],
[101, 4658, 1012, 102],
[101, 3835, 999, 102],
]

As a teacher, I would like to use print(encoded_sequences) to show The resulting output.

zongxiao · August 30, 2023, 8:53am

Try it out! Replicate the two last steps (tokenization and conversion to input IDs) on the input sentences we used in section 2 (“I’ve been waiting for a HuggingFace course my whole life.” and “I hate this so much!”). Check that you get the same input IDs we got earlier!

decoded_string = tokenizer.decode([101, 146, 112, 1396, 1151, 2613, 1111, 170, 20164, 10932,
2271, 7954, 1736, 1139, 2006, 1297, 119, 102])
print(decoded_string)
[CLS] I’ve been waiting for a HuggingFace course my whole life. [SEP]

decoded_string = tokenizer.decode([146, 112, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119])

print(decoded_string)

I’ve been waiting for a HuggingFace course my whole life.

what’s the difference?

zongxiao · August 31, 2023, 12:24am

Otherwise, we recommend you truncate your sequences by specifying the max_sequence_length parameter:

sequence = sequence[:max_sequence_length]

I suggest give a example here
sequence = "I've been waiting for a HuggingFace course my whole life."

max_sequence_length = 3

sequence = sequence[:max_sequence_length]

print(sequence)

Topic		Replies	Views
Chapter 2: Different logits for otherwise identical tokenization "pipelines" Course	1	293	April 29, 2024
Machine Translation using Hugging Face problem Intermediate	0	323	May 8, 2023
NLP Sense Making Beginners	0	421	March 31, 2022
Optimization strategie 🤗Transformers	0	267	October 21, 2022
Chapter 7 questions Course	119	10310	July 10, 2025

Chapter 2 questions

Related topics