Chapter 2 questions

Hi @vsrinivas we’re exploring the possibility to add transcripts for all the videos (English PR here). If this works well on YouTube, we’ll open a call for the community to help translate them or use Whisper :slight_smile:

Hi @vsrinivas I’m happy to share that the English subtitles are now available for most of the course videos! Hope this helps :slight_smile:

Thanks a lot for letting me know. Really appreciate it.


1 Like

The Models expect a batch of inputs section might be outdated? It seems that tokenizers have been updated to account for single-dimensional input - I couldn’t reproduce the error in the first code snippet.

As of Dec 2022, the lines

input_ids = tf.constant(ids)
input_ids = tf.constant([ids])

produce the exact same output object.

The course content about model.save_pretrained() is slightly misleading. It gives the example:


and then says it will save two files to “your disk”.

Even though I am using coLab, my assumption was that this would save two files to my local disk, which meant a fair amount of fruitless time fiddling around to try to find the files.

It might be helpful for other newbies to specify that the files as saved to your coLab files (and that you may need to refresh your file view for the new directory to show up).

Thanks for a great course!

When I run the steps of the sentiment-analysis pipeline sequentially, I get almost the same result, but the precision is different. The pipelined run gives me 15 digits of precision (eg “0.993750274181366”) but running the steps sequentially only gives me 7 digits of precision (eg “0.9937503”).

Is there a way to set the desired precision?

It is said that BERT could be used directly for inferring task. I thought BERT was a encoder, and therefore unable to make prediction.

After a lot of research I couldn’t find an answer to this question:

I have a local model that was pre trained on the text regression task
and this model makes the prediction with a single output (decimal number between 0 and 1)

so how can I load this model and test it? using which class?
without any modification on the architecture of the model

thank you

Multilingual Tokinzer

  • List item
  • Please, how does a Tokenizer work for a multilingual dataset?
  • Who gets to add the special tokens? me or the tokenizer?
  • Can I customise my own special tokens other than what comes with AutoTokenizer? Say [Unq] instead [CLS].

I’ll appreciate if I could get some explanations to this, especially to the multilingual part.

How to process long sequences if i do not want to truncate to a fixed length. Some sequences will lose information.

I just wanted to note that the concept of “checkpoints” is not introduced (or at least I did not see it). Still this concept is used everywhere.

The special tokens of gpt2 from tokenizer

From this , the gpt2 tokenizer has defined bos_token and eos_token. However, when I tried with:

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokens = tokenizer("Using a Transformer network is simple")

I do not see the token of the begin and end of sentences. Does it mean that the gpt2 model was trained without these tokens ?

Many thanks !

It seems to me that, in the Tokenizers/Encoding section, the output given by the tokenize() method on the string Using a Transformer network is simple actually is the output of the tokenizer of the bert-base-uncased model, and not of bert-base-cased, as indicated in the code.
The actual output should be ['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple'].