Chapter 7 questions

Use this topic for any question about Chapter 7 of the course.

1 Like

Hi - this is a relatively simple question but iā€™m totally new to HuggingFace so apologies in advance but on section 3 you discuss domain adaption.

Iā€™m just experimenting with the task at the end of the section i.e. ā€œTo quantify the benefits of domain adaptation, fine-tune a classifier on the IMDb labels for both the pretrained and fine-tuned MiniLM checkpointsā€¦ā€

Can you use the ā€˜Fill-Maskā€™ domain-adapted checkpoint you generated in the course (huggingface-course/distilbert-base-uncased-finetuned-imdb) for a classification task? Or do you have to adapt the original distilbert-base-uncased to the domain specifically for classification?

1 Like

No, you would need to fine-tune it on the classification task next. Itā€™s just that the fine-tuned masked language model might do better, since itā€™s more specialized on your corpus.

Hope that makes sense!

Thank you!! So I should just follow the standard fine-tuning methodology for classification as per Ch. 3 but use the ā€˜fine-tuned masked modelā€™ as the starting checkpoint? i.e.

from transformers import TFAutoModelForSequenceClassification, AutoTokenizer

checkpoint = 'huggingface-course/distilbert-base-uncased-finetuned-imdb'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

...

Thatā€™s correct indeed!

Thank you very much!

Perhaps this is a typo? This sentence in the Question answering section

The answers field is a bit trickier as it comports a dictionary with two fields that are both lists.

I guess here it means comprises

1 Like

This is with regards to the translation section.

I donā€™t understand what is the purpose of adding a padding token at the start of the decoder_input_ids.
I understand that the decoder_input_ids is the labels shifted by one.

Does the labels have the ground truth, and when we check the next token for decoder_input_ids, we then compare it to the labels as part of the training?
i.e.

batch['labels'] = tensor([[   83,  7471,    23, ...]])
batch['decoder_input_ids'] = tensor([[59513,    83,  7471,    23, ...]])
where 59513 is the pad token.

Many thanks

The decoder is generating an output by predicting each token one after the other with:

  • the encoder hidden state form the inputs
  • the previously predicted tokens of the outputs.

But for the very first token, there is no already previously predicted tokens, so we feed it a special token, which might be the pad token, or a special ā€œbeginning of streamā€ (bos) token. This part depends on the exact model.

Thatā€™s why the decoder inputs are the labels shifted by one with this special token at the start.

2 Likes

In this code snippet, what is eval_preds? I can see it is the argument to the compute_metrics function, but I donā€™t know what it is, and hence why we know we can split it as a tuple.

def compute_metrics(eval_preds):
    preds, labels = eval_preds

I can see it is one of the arguments for Seq2SeqTrainer:

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,

But where would is it getting itā€™s argument, eval_preds from? The purpose of the compute_metrics function is to compare the predicted values with the actual values.
Are the labels (actual values) - data_collator[ā€˜labelsā€™] ?
Where do we get the predicted values from?

Many thanks.

I have seen ā€œlabelsā€ used in translation

model_inputs["labels"] = labels["input_ids"]

and in token classification

tokenized_inputs["labels"] = new_labels

Is ā€œlabelsā€ always used to hold the ground-truth, please?

Many thanks

In the translation section, what is the difference between
AutoModelForSeq2SeqLM and AutoModelForCausalLM please?
Is it:
AutoModelForSeq2SeqLM is used for language translation tasks
AutoModelForCausalLM is only for text generation (e.g. GPT+)

Many thanks

In the Chapter7:
Task : Question answering
When Running this code :

tf_train_dataset = train_dataset.to_tf_dataset(
    columns=[
        "input_ids",
        "start_positions",
        "end_positions",
        "attention_mask",
        "token_type_ids",
    ],
    dummy_labels=True,
    shuffle=True,
    batch_size=16,
)
tf_eval_dataset = validation_dataset.to_tf_dataset(
    columns=["input_ids", "attention_mask", "token_type_ids"],
    shuffle=False,
    batch_size=16,
)

I get this error :
TypeError: to_tf_dataset() missing 1 required positional argument: ā€˜collate_fnā€™
How to do it, especially as the validation and train datasets already include a padding to the max length!

1 Like

Hi @Abirate, thank you for this bug report! This is our fault - we recently changed the to_tf_dataset method to always require a collate_fn. Iā€™m working on updating the course materials right now, and Iā€™ll let you know as soon as a fixed version is available.

2 Likes

@Rocketknight1 Ok, thanks

Hi,

I had this problem running the evaluation on Colab. Any ideas?

***** Running Evaluation *****
Num examples = 21018
Batch size = 64
[329/329 21:25]

TypeError Traceback (most recent call last)
in ()
----> 1 trainer.evaluate(max_length=max_target_length)

2 frames
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
2407
2408 if all_losses is not None:
ā†’ 2409 metrics[f"{metric_key_prefix}loss"] = all_losses.mean().item()
2410
2411 # Prefix all keys with metric_key_prefix + '
ā€™

TypeError: ā€˜NoneTypeā€™ object does not support item assignment

1 Like

Hey @mbateman in which section in chapter 7 do you find this error? Iā€™d like to run the relevant Colab notebook myself to see if I can reproduce the error :slight_smile:

Hi @lewtun thanks for getting back to me. This was in the fine tuning subsection of the translation section:

trainer.evaluate(max_length=max_target_length)
trainer.train()
trainer.evaluate(max_length=max_target_length)

Doesnā€™t happen when run locally.

Hope that helps.

Michael

I encountered the same problem and it seem problem is compute_metrics does not return anything and also metric.compute is not used inside that function. Since return value is missing, this result in metrics = None and then NoneType item assignment error. Adding return metric.compute(predictions=decoded_preds, references=decoded_labels) to compute_metrics solve the problem for me.

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    
    return metric.compute(predictions=decoded_preds, references=decoded_labels)
1 Like

Thanks for catching this bug @PyaePK ! Iā€™ll post a fix to the website and notebooks :hugs: