Chapter 2 questions

Hi @vsrinivas we’re exploring the possibility to add transcripts for all the videos (English PR here). If this works well on YouTube, we’ll open a call for the community to help translate them or use Whisper :slight_smile:

Hi @vsrinivas I’m happy to share that the English subtitles are now available for most of the course videos! Hope this helps :slight_smile:

Thanks a lot for letting me know. Really appreciate it.

Regards,
Srinivas

1 Like

The Models expect a batch of inputs section might be outdated? It seems that tokenizers have been updated to account for single-dimensional input - I couldn’t reproduce the error in the first code snippet.

As of Dec 2022, the lines

input_ids = tf.constant(ids)
input_ids = tf.constant([ids])

produce the exact same output object.

The course content about model.save_pretrained() is slightly misleading. It gives the example:

model.save_pretrained("directory_on_my_computer")

and then says it will save two files to “your disk”.

Even though I am using coLab, my assumption was that this would save two files to my local disk, which meant a fair amount of fruitless time fiddling around to try to find the files.

It might be helpful for other newbies to specify that the files as saved to your coLab files (and that you may need to refresh your file view for the new directory to show up).

Thanks for a great course!

When I run the steps of the sentiment-analysis pipeline sequentially, I get almost the same result, but the precision is different. The pipelined run gives me 15 digits of precision (eg “0.993750274181366”) but running the steps sequentially only gives me 7 digits of precision (eg “0.9937503”).

Is there a way to set the desired precision?

It is said that BERT could be used directly for inferring task. I thought BERT was a encoder, and therefore unable to make prediction.

After a lot of research I couldn’t find an answer to this question:

I have a local model that was pre trained on the text regression task
and this model makes the prediction with a single output (decimal number between 0 and 1)

so how can I load this model and test it? using which class?
without any modification on the architecture of the model

thank you

Multilingual Tokinzer

  • List item
  • Please, how does a Tokenizer work for a multilingual dataset?
  • Who gets to add the special tokens? me or the tokenizer?
  • Can I customise my own special tokens other than what comes with AutoTokenizer? Say [Unq] instead [CLS].

I’ll appreciate if I could get some explanations to this, especially to the multilingual part.

How to process long sequences if i do not want to truncate to a fixed length. Some sequences will lose information.

I just wanted to note that the concept of “checkpoints” is not introduced (or at least I did not see it). Still this concept is used everywhere.

The special tokens of gpt2 from tokenizer

From this , the gpt2 tokenizer has defined bos_token and eos_token. However, when I tried with:

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokens = tokenizer("Using a Transformer network is simple")
print(tokenizer.decode(tokens["input_ids"]))

I do not see the token of the begin and end of sentences. Does it mean that the gpt2 model was trained without these tokens ?

Many thanks !

From the API documentation of tokenizers I see the add_special_tokens parameter for the encode method: Tokenizer

From Putting it all together - Hugging Face NLP Course of this course:
" The tokenizer added the special word [CLS] at the beginning and the special word [SEP] at the end. This is because the model was pretrained with those, so to get the same results for inference we need to add them as well."
So, I was wondering, when should I put the add_special_tokens parameter to false?

Hey!

I tried to feed individual tokenized sentence ids to the model and the a batch of different sentences. (As asked in the challenge in the chapter) I am not sure why I get different logit values for the two methods, am I doing something wrong? My notebooks hosted here: https://github.com/ArindamRoy23/Huggin_Faces_Course/blob/main/HF_Using_Transformers.ipynb

I tried the step by step method given in chapter 2 → “Behind the pipeline” section and it gave me wrong results. I have a feeling that it is giving me the results from my last run and that result is cached. Is this possible?

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_inputs = [
    "It’s not a good app.",
    "My experience was really bad.",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

print(outputs.logits.shape)
print(outputs.logits)

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

result = model.config.id2label
print(result)```

Results:

torch.Size([2, 10, 768])
torch.Size([2, 2])
tensor([[ 4.7721, -3.7753],
[ 4.6483, -3.7990]], grad_fn=)
tensor([[9.9981e-01, 1.9402e-04],
[9.9979e-01, 2.1442e-04]], grad_fn=)
{0: ‘NEGATIVE’, 1: ‘POSITIVE’}

in the Tokenization section, do we load and save tokenizers in order to enrich its vocab and train it? tokenizer(“Using a Transformer network is simple”) this line for example, what is the purpose of doing that and then saving it? is this how we enrich our tokenizer?

Concerning the maximum length of tokens that a model can accept, is that because we use a pre-trained mode? it is possible to initialize a model from scratch and train it in a way that make it accept longer sequences?

Let’s say we have a couple of sequences:

Copied
sequences = [“Hello!”, “Cool.”, “Nice!”]
The tokenizer converts these to vocabulary indices which are typically called input IDs. Each sequence is now a list of numbers! The resulting output is:

Copied
encoded_sequences = [
[101, 7592, 999, 102],
[101, 4658, 1012, 102],
[101, 3835, 999, 102],
]

As a teacher, I would like to use print(encoded_sequences) to show The resulting output.

:pencil2: Try it out! Replicate the two last steps (tokenization and conversion to input IDs) on the input sentences we used in section 2 (“I’ve been waiting for a HuggingFace course my whole life.” and “I hate this so much!”). Check that you get the same input IDs we got earlier!

decoded_string = tokenizer.decode([101, 146, 112, 1396, 1151, 2613, 1111, 170, 20164, 10932,
2271, 7954, 1736, 1139, 2006, 1297, 119, 102])
print(decoded_string)
[CLS] I’ve been waiting for a HuggingFace course my whole life. [SEP]

decoded_string = tokenizer.decode([146, 112, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119])

print(decoded_string)

I’ve been waiting for a HuggingFace course my whole life.

what’s the difference?

Otherwise, we recommend you truncate your sequences by specifying the max_sequence_length parameter:

sequence = sequence[:max_sequence_length]

I suggest give a example here
sequence = "I've been waiting for a HuggingFace course my whole life."

max_sequence_length = 3

sequence = sequence[:max_sequence_length]

print(sequence)