Chapter 7 questions

Hi, I have a question regarding the token classification notebook referring to this Code Snippet. How to print the results of 0 labels, I- labels and B-labels?


predictions, labels, _ = trainer.predict(tokenized_datasets["validation"])
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
results

Hey @ghadeermobasher I think with a few tweaks, you should be able to get the entities per token by adapting the example in the docs here: Summary of the tasks

1 Like

Hi all!

In chapter 7 ‘Fine-tuning a masked language model’ there is a default data collator for subwords and a custom data collator called ‘whole_word_masking_data_collator’.

When I follow the tutorial the default data collator is running fine but when I use the ‘whole_word_masking_data_collator’ in the Trainer I get an error ‘KeyError: word_ids’.

Does anyone know how to fix this?

Greetings!

Thanks for reporting this bug @Louisiana ! I’ll post a fix but for now you set remove_unused_columns=False inside TrainingArguments (docs)

h/t @sgugger for the suggested fix :slight_smile:

2 Likes

@sgugger @lewtun Thanks for the fix!

1 Like

Hi :hugs:

I am experiencing a quite weird and unexplained issue/bug : using Google Colab, I am doing the courses related to Summarization
However, when fitting the model, I am using the PushToHubCallback to save and upload the checkpoints to the Hub. Unfortunately, even if the training is done, the model files are created and stored localy in Colab (/content/mt5-small-finetuned-amazon-en-es) but never uploaded to my HF hub…The fit command keep running, without any additional verbose. It’s hours now I tried to fix it but didn’t find yet the reason why neither the solution, would appreciate you help please.

Thanks so much !

In the question answering section, I think there is a bug in this line:

if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
    start_positions.append(0)
    end_positions.append(0)

It should be:

if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
    start_positions.append(0)
    end_positions.append(0)

Am I correct? This is because earlier in the section, it’s written that:

“We will also set those labels (0, 0) in the unfortunate case where the answer has been truncated so that we only have the start (or end) of it.”

The following diagram explains this:

Context 1 fully contains the answer. Context 2 STARTS AFTER the answer STARTS. Context 3 ENDS BEFORE the answer ENDS.

Hi @Amba thanks for reporting this issue - I was able to reproduce it on my end, even when I restrict training to 1 epoch and just 10 examples for the train and validation datasets. I can also confirm that running model.fit() without the PushToHubCallback runs as expected, e.g. this works:

model.fit(
    tf_train_dataset, validation_data=tf_eval_dataset, epochs=1
)

@Rocketknight1 would you mind taking a look at this as it seems to be a deeper problem in the PushToHubCallback?

Hi @Sadhaklal I think you’re right about the inequalities in the offsets - your fix is similar to what is also present in the question-answering scripts in transformers (there it’s just the negation of the your fix).

I’ll post a fix to the website and the notebooks - thanks!

1 Like

Thanks Lewis for confirming. Looking forward to reading your book after finishing the course. :slightly_smiling_face:

1 Like

Will try to investigate this one today!

Hi @Amba and @lewtun, I investigated this but I wasn’t able to reproduce the problem! I ran the code locally with PushToHubCallback and the model was successfully uploaded.

There are a few possibilities I can think of:

  1. Low upload bandwidth - the model is very large, and it’s normal for the PushToHubCallback to upload the model without additional console printing after the end of training. This can take quite a long time for a large model, even if you only train it for a few samples!
  2. Outages at hf.co - we’ve had a few of those recently, which caused a lot of test failures. This might have caused the upload to hang.
  3. Some issue with Colab - I’m going to try to rerun all the code on Colab instead of my local environment and see if I can reproduce the problem there.

Update: I’m seeing the bug in Colab! It’s not just a bandwidth issue - the mid-training uploads are hanging for some reason. We’re investigating now.

Quick update for @Amba and @lewtun - we pushed a fix to this issue. It’ll be in the next release of Transformers, but in the meantime, you can install directly from Github by replacing the line

!pip install transformers

in your Colab notebook with

!pip install git+https://github.com/huggingface/transformers.git

Let us know if you encounter any other problems with it!

1 Like

Hi!

In the question aswering section, I´m trying to execute the compute_metrics function for the validation dataset

compute_metrics(start_logits, end_logits, validation_dataset, raw_datasets["validation"])

but I get the following error:

Input In [22], in compute_metrics(start_logits, end_logits, features, examples)
     30             if (
     31                 end_index < start_index
     32                 or end_index - start_index + 1 > max_answer_length
     33             ):
     34                 continue
     36             answer = {
---> 37                 "text": context[offsets[start_index][0] : offsets[end_index][1]],
     38                 "logit_score": start_logit[start_index] + end_logit[end_index],
     39             }
     40             answers.append(answer)
     42 # Select the answer with the best score

IndexError: list index out of range

I had to replace

predictions, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions

with

predictions= trainer.predict(validation_dataset)
start_logits, end_logits = predictions.predictions

before executing compute_metrics.

Could someone run this section of the colab without errors?

Hi everyone,
In chapter 7 “Fine-tuning a masked language model” when I run in Colab the “whole_word_masking_data_collator” I get this error_

ImportError: cannot import name ‘tf_default_data_collator’ from ‘transformers.data’ (/usr/local/lib/python3.7/dist-packages/transformers/data/init.py)

Does anyone know how to fix this?

Thank you!

Hi
In summarization tutorial when we are generating predictions for evaluation batch of data for calculating predictions is passed to model with **

predictions = model.generate(**batch)

like that. I wonder what these two stars mean?

Thanks in advance!

Hey everyone,
After completing the translation tutorial, I am receiving this error.

InvalidArgumentError: Exception encountered when calling layer "encoder" (type TFMarianEncoder).

cannot compute Mul as input #1(zero-based) was expected to be a half tensor but is a float tensor [Op:Mul]

Call arguments received:
  • input_ids=tf.Tensor(shape=(3, 5), dtype=int32)
  • inputs_embeds=None
  • attention_mask=tf.Tensor(shape=(3, 5), dtype=bool)
  • head_mask=None
  • output_attentions=False
  • output_hidden_states=False
  • return_dict=True
  • training=False

Does anybody know how yo solve this?

Thank you very much

An error in parallelism in tokenization using dataset.map

I encountered an error running this code. I ran it without problem a few weeks ago. Apparently, if batch=true, the function requires all inputs have the same length. But I don’t want pad my input at this step. Does anyone has any idea of how to solve it?

Hi. Everyony.
Thanks for reading it.
I did a Summarization task for understanding fine-tune with mT5 model.
My HW has two GPUS, one is A5000, the other is A4000.
In Fine-tuning mT5 with the Trainer API section.
When I run the “trainer.train()”, I get the below errors.

/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py:30: UserWarning: 
    There is an imbalance between your GPUs. You may want to exclude GPU 1 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
***** Running training *****
  Num examples = 9672
  Num Epochs = 8
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 4840
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
 [ 606/4840 03:42 < 25:56, 2.72 it/s, Epoch 1/8]
Epoch	Training Loss	Validation Loss
Saving model checkpoint to mt5-small-finetuned-amazon-en-es/checkpoint-500
Configuration saved in mt5-small-finetuned-amazon-en-es/checkpoint-500/config.json
Model weights saved in mt5-small-finetuned-amazon-en-es/checkpoint-500/pytorch_model.bin
tokenizer config file saved in mt5-small-finetuned-amazon-en-es/checkpoint-500/tokenizer_config.json
Special tokens file saved in mt5-small-finetuned-amazon-en-es/checkpoint-500/special_tokens_map.json
Copy vocab file to mt5-small-finetuned-amazon-en-es/checkpoint-500/spiece.model
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py:30: UserWarning: 
    There is an imbalance between your GPUs. You may want to exclude GPU 1 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
***** Running Evaluation *****
  Num examples = 238
  Batch size = 16
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-32-3435b262f1ae> in <module>
----> 1 trainer.train()

/usr/local/lib/python3.8/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1346 
   1347             self.control = self.callback_handler.on_epoch_end(args, self.state, self.control)
-> 1348             self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
   1349 
   1350             if DebugOption.TPU_METRICS_DEBUG in self.args.debug:

/usr/local/lib/python3.8/dist-packages/transformers/trainer.py in _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval)
   1443         metrics = None
   1444         if self.control.should_evaluate:
-> 1445             metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
   1446             self._report_to_hp_search(trial, epoch, metrics)
   1447 

/usr/local/lib/python3.8/dist-packages/transformers/trainer_seq2seq.py in evaluate(self, eval_dataset, ignore_keys, metric_key_prefix, max_length, num_beams)
     73         self._max_length = max_length if max_length is not None else self.args.generation_max_length
     74         self._num_beams = num_beams if num_beams is not None else self.args.generation_num_beams
---> 75         return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
     76 
     77     def predict(
...
-> 2183     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2184 
   2185 

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

I was trying with CUDA_VISIBLE_DEVICE=0 python train.py also. but Error is still happened.
I run this code with One GPU Machine. It is OK.
I want to run this code in the two GPUs Machine.

In Accelerate Section,
tokenized_datasets.set_format(“torch”) command makes an error when I run the training code.
Error Message is below

Traceback (most recent call last):
  File "summary_train.py", line 211, in <module>
    for step, batch in enumerate(train_dataloader):
  File "/usr/local/lib/python3.8/dist-packages/accelerate/data_loader.py", line 301, in __iter__
    for batch in super().__iter__():
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 570, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.8/dist-packages/transformers/data/data_collator.py", line 531, in __call__
    feature["labels"] + remainder if padding_side == "right" else remainder + feature["labels"]
TypeError: unsupported operand type(s) for +: 'Tensor' and 'list'

When I remove

tokenized_datasets.set_format("torch")

codes is OK.