Use this topic for any question about Chapter 7 of the course.
Hi - this is a relatively simple question but iām totally new to HuggingFace so apologies in advance but on section 3 you discuss domain adaption.
Iām just experimenting with the task at the end of the section i.e. āTo quantify the benefits of domain adaptation, fine-tune a classifier on the IMDb labels for both the pretrained and fine-tuned MiniLM checkpointsā¦ā
Can you use the āFill-Maskā domain-adapted checkpoint you generated in the course (huggingface-course/distilbert-base-uncased-finetuned-imdb) for a classification task? Or do you have to adapt the original distilbert-base-uncased to the domain specifically for classification?
No, you would need to fine-tune it on the classification task next. Itās just that the fine-tuned masked language model might do better, since itās more specialized on your corpus.
Hope that makes sense!
Thank you!! So I should just follow the standard fine-tuning methodology for classification as per Ch. 3 but use the āfine-tuned masked modelā as the starting checkpoint? i.e.
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
checkpoint = 'huggingface-course/distilbert-base-uncased-finetuned-imdb'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
...
Thatās correct indeed!
Thank you very much!
Perhaps this is a typo? This sentence in the Question answering section
The
answers
field is a bit trickier as it comports a dictionary with two fields that are both lists.
I guess here it means comprises
This is with regards to the translation section.
I donāt understand what is the purpose of adding a padding token at the start of the decoder_input_ids.
I understand that the decoder_input_ids is the labels shifted by one.
Does the labels have the ground truth, and when we check the next token for decoder_input_ids, we then compare it to the labels as part of the training?
i.e.
batch['labels'] = tensor([[ 83, 7471, 23, ...]])
batch['decoder_input_ids'] = tensor([[59513, 83, 7471, 23, ...]])
where 59513 is the pad token.
Many thanks
The decoder is generating an output by predicting each token one after the other with:
- the encoder hidden state form the inputs
- the previously predicted tokens of the outputs.
But for the very first token, there is no already previously predicted tokens, so we feed it a special token, which might be the pad token, or a special ābeginning of streamā (bos) token. This part depends on the exact model.
Thatās why the decoder inputs are the labels shifted by one with this special token at the start.
In this code snippet, what is eval_preds? I can see it is the argument to the compute_metrics function, but I donāt know what it is, and hence why we know we can split it as a tuple.
def compute_metrics(eval_preds):
preds, labels = eval_preds
I can see it is one of the arguments for Seq2SeqTrainer:
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
But where would is it getting itās argument, eval_preds from? The purpose of the compute_metrics function is to compare the predicted values with the actual values.
Are the labels (actual values) - data_collator[ālabelsā] ?
Where do we get the predicted values from?
Many thanks.
I have seen ālabelsā used in translation
model_inputs["labels"] = labels["input_ids"]
and in token classification
tokenized_inputs["labels"] = new_labels
Is ālabelsā always used to hold the ground-truth, please?
Many thanks
In the translation section, what is the difference between
AutoModelForSeq2SeqLM and AutoModelForCausalLM please?
Is it:
AutoModelForSeq2SeqLM is used for language translation tasks
AutoModelForCausalLM is only for text generation (e.g. GPT+)
Many thanks
In the Chapter7:
Task : Question answering
When Running this code :
tf_train_dataset = train_dataset.to_tf_dataset(
columns=[
"input_ids",
"start_positions",
"end_positions",
"attention_mask",
"token_type_ids",
],
dummy_labels=True,
shuffle=True,
batch_size=16,
)
tf_eval_dataset = validation_dataset.to_tf_dataset(
columns=["input_ids", "attention_mask", "token_type_ids"],
shuffle=False,
batch_size=16,
)
I get this error :
TypeError: to_tf_dataset() missing 1 required positional argument: ācollate_fnā
How to do it, especially as the validation and train datasets already include a padding to the max length!
Hi @Abirate, thank you for this bug report! This is our fault - we recently changed the to_tf_dataset
method to always require a collate_fn
. Iām working on updating the course materials right now, and Iāll let you know as soon as a fixed version is available.
@Rocketknight1 Ok, thanks
Hi,
I had this problem running the evaluation on Colab. Any ideas?
***** Running Evaluation *****
Num examples = 21018
Batch size = 64
[329/329 21:25]
TypeError Traceback (most recent call last)
in ()
----> 1 trainer.evaluate(max_length=max_target_length)
2 frames
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
2407
2408 if all_losses is not None:
ā 2409 metrics[f"{metric_key_prefix}loss"] = all_losses.mean().item()
2410
2411 # Prefix all keys with metric_key_prefix + 'ā
TypeError: āNoneTypeā object does not support item assignment
Hey @mbateman in which section in chapter 7 do you find this error? Iād like to run the relevant Colab notebook myself to see if I can reproduce the error
Hi @lewtun thanks for getting back to me. This was in the fine tuning subsection of the translation section:
trainer.evaluate(max_length=max_target_length)
trainer.train()
trainer.evaluate(max_length=max_target_length)
Doesnāt happen when run locally.
Hope that helps.
Michael
I encountered the same problem and it seem problem is compute_metrics
does not return anything and also metric.compute
is not used inside that function. Since return value is missing, this result in metrics = None
and then NoneType
item assignment error. Adding return metric.compute(predictions=decoded_preds, references=decoded_labels)
to compute_metrics
solve the problem for me.
def compute_metrics(eval_preds):
preds, labels = eval_preds
# In case the model returns more than the prediction logits
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
# Replace -100s in the labels as we can't decode them
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Some simple post-processing
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [[label.strip()] for label in decoded_labels]
return metric.compute(predictions=decoded_preds, references=decoded_labels)
Thanks for catching this bug @PyaePK ! Iāll post a fix to the website and notebooks