Chapter 3 questions

when I do:
from transformers import AutoTokenizer, AutoModel
I would have expected to find a AutoTokenizer.py file and an AutoModel.py file, but they aren’t there.

(Typically from module_a import b, c means that b.py and c.py exist as python files in the module_a directory?).

  1. What is wrong with my thinking?
  2. Where can I find the code for AutoTokenizer and AutoModel please?

Thanks

Hey @iamholmes in general we use the __init__.py file under src/transformers to define all the class imports. For example, here is the import for AutoModel

Above that line you can see we import from models.auto and indeed here are all the Python files associated with the auto-classes for models and tokenizers.

Hope that helps!

1 Like

Hi guys, a quick question about the Accelarate API.

I saw in the course that you explain how to use it in a Pytorch training loop. I was wondering if there’s a way to integrate it in a TrainerAPI-based loop, or if there is a way to exploit multiple GPUs in the Trainer API itself.

Many thanks!

Hi @Neuroinformatica! You can exploit multiple GPUs with the Trainer (see the docs). By default it will use all available GPUs for training, but you can configure that by setting:

import os

# Just train on a single device
os.environ["CUDA_VISIBLE_DEVICES"]="0"
1 Like

Hi @sgugger,

First of all, thanks a lot for the Huggingface Course - it’s a great resource to get started with Transformer models!

While I was working my way through chapter 3, I noticed a potentially misleading comment. You write:

# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs

But in the calculation afterwards you only multiply the length of the Tensorflow dataset by the number of epochs.

num_train_steps = len(tf_train_dataset) * num_epochs

I see where you’re coming from since len(tf_train_dataset) corresponds to tokenized_datasets["train"] divided by the batch size. Nevertheless, you could either elaborate a bit more on that in the comment or just leave it out so it’s less misleading for beginners like me. However, this is simply a suggestion and may be obvious or irrelevant to many people!

Hey @bonschorno thanks for the feedback! I think you’re right that we should actually have:

num_train_steps = num_train_steps = len(tf_train_dataset) * num_epochs // batch_size

I’ll fix this on the website and notebook - thank you!

2 Likes

Hi @lewtun,
Excellent, thanks for the swift reply. And again: The Huggingface team did a great job with the whole course!

1 Like

Thanks @bonschorno - I’m glad you’re enjoying the course :slight_smile:

Btw after chatting with @Rocketknight1 (the TensorFlow maintainer of transformers), he pointed out that:

The tf.data.Dataset objects here already have a batch() operation applied - they’re more like torch DataLoader objects. After a batch operation, their len() is num_samples // batch_size already, so we shouldn’t need to do that division twice.

In any case, we’ll improve the comment because it confused me as well :slight_smile:

I want to save model training as checkpoints on my local computer. How can I pass the location similar to the path in TrainingArgument?

Hi,

In Chapter 3, in the section “Processing the data”, this is given:

tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

This works well, but it has the disadvantage of returning a dictionary (with our keys, input_ids, attention_mask, and token_type_ids, and values that are lists of lists). It will also only work if you have enough RAM to store your whole dataset during the tokenization (whereas the datasets from the :hugs: Datasets library are Apache Arrow files stored on the disk, so you only keep the samples you ask for loaded in memory).

Can somebody explain why the disadvantage is of returning a dictionary?

Thanks.

Hi, Please include “A full training” section of chapter 3 tensorflow version as well. As I am following complete course in tensorflow

1 Like

Hi @sgugger , in the notebook Google Colab
The below code is PyTorch version:

from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer)
train_dataloader = DataLoader(
    tokenized_datasets["train"], batch_size=16, shuffle=True, collate_fn=data_collator
)

for step, batch in enumerate(train_dataloader):
    print(batch["input_ids"].shape)
    if step > 5:
        break

Can you please help me converting to it into Tensorflow version

+1 … it’s not clear what folks who are using TF should do with this section. Thanks!

Note that the “evaluate” package mentioned in the example requires a package called “sklearn”, and if you try to run metric.compute() locally you will get a message about needing to run “pip install sklearn”.

However, the “sklearn” package is going through a “brownout” and is only a stub. To get the proper packages, you need to install “scikit-learn” instead:

pip install scikit-learn
1 Like

Hi,

In this section of chapter 3, you have mentioned the following:

The Trainer will work out of the box on multiple GPUs or TPUs and provides lots of options, like mixed-precision training (use fp16 = True in your training arguments). We will go over everything it supports in Chapter 10.

Does we have Chapter 10 in the course?

1 Like

I supports you.

I faced such an error and I already opened an issue on github, here

** Fine-tuning a model with the Trainer API**
Hi @lewtun, I trust you are well.
predictions = trainer.predict(tokenized_datasets["validation"])
Please, how do I make predictions in an inference mode?
More like: trainer.predict(['The man is sick])

Hi @lewtun
I wrote the training + evaluation loop, but the script never gets to the evaluation part. Please do you know why?

from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps)) 
count =0
for epoch in range(num_epochs):
  model.train()
  for data in train_dl:
    data = {k:v.to(device) for k,v in data.items()}
    output = model(**data)
    loss = output.loss
    loss.backward()

    optimizer.step()
    x = optimizer
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)


    count+=1
    
  if count % 100 ==0:
    print(count)
    model.eval()
    for data in validation_dl:
      data = {k:v.to(device) for k,v in data.items()}
      with torch.no_grad():
        outputs = model(**batch)
        logits = outputs.logits
        preds = torch.argmax(logits, axis=-1)
        metric.add_batch(predictions=preds, references=data['labels'] )

    metric.compute()




At the end of “Processing the data” you suggest a harder challenge of building a processing function that works for all GLUE tasks. I took a swing and had some questions:

  1. It looks like “ax” only has “test” – we don’t do any tokenization of the “test” set that I saw – I imagine we should, though?
  2. “ax” also has “premise” and “hypothesis” – I’m guessing these just become “sentence1” and “sentence2”?

Given these differences, do we basically write conditional code for train/test/validation and “sentence” vs “sentence1/2” vs “hypothesis/premise” or is there a better way to do this? I don’t imagine the “AutoTokenizer” handles this for us in a convenient way?

Regarding this statement made in Chapter 3 under Fine-tuning with Trainer API section

You will notice that unlike in Chapter 2, you get a warning after instantiating this pretrained model. This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead

I wanted to ask if this is the case with any kind of model? In this case since BERT is not pretrained on classifying pairs but we are using it for that purpose the head has been replaced. If it was some other model maybe GPT, T5 or something else would the same scenario apply(the head getting replaced to overcome the lack of the specific pretraining objective)?