Chapter 7 questions

I’m trying this exercise:

Try it out! Getting rid of all the chunks that are smaller than the context size wasn’t a big issue here because we’re using small context windows. As you increase the context size (or if you have a corpus of short documents), the fraction of chunks that are thrown away will also grow. A more efficient way to prepare the data is to join all the tokenized samples in a batch with an eos_token_id token in between, and then perform the chunking on the concatenated sequences. As an exercise, modify the tokenize() function to make use of that approach. Note that you’ll want to set truncation=False and remove the other arguments from the tokenizer to get the full sequence of token IDs.

I subsetted the dataset to be able to experiment with the processing rapidly.

raw_datasets_short = DatasetDict(
    {
        "train": ds_train.select(range(100)),
        "valid": ds_valid.select(range(100)),
    }
)

I modified the function so that we don’t truncate the sequences. Note also that I’m using the Bert checkpoint, because I noticed that the “huggingface-course/code-search-net-tokenizer” checkpoint in the lesson doesn’t automatically add special tokens. Side-note, where in the model card/API would we check whether a given checkpoint automatically adds special tokens?

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

  def tokenize_long(element):
      outputs = tokenizer(
          element["content"],
          truncation=False,
          return_length=True,
      )
      input_batch = []
      for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        input_batch.append(input_ids)
      return {"input_ids": input_batch}

alt_tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 100
    })
    valid: Dataset({
        features: ['input_ids'],
        num_rows: 100
    })
})

My questions are:

  1. Is the objective to concatenate all samples form one batch into 1 long sequence, then break them up into chunks. My issue is how do I break them up into chunks? Seems like I need to do this in the for loop?
  2. What should the shape of the resultant dataset look like (i.e. how many records in train and test)?
1 Like

I am getting the exact same error and could not find solution. I am also getting an error for mid score. Can anyone help? @lewtun

Hi, I’m relatively new to this platform. Have been reading the docs in HF and it is one of the well designed websites with all the information.
Question: I’ve similar requirement in my project. So, trying to dry run before playing around with my proprietary data. Unfortunately, running into this error.
TypeError: DistilBertForQuestionAnswering.forward() got an unexpected keyword argument ‘token_type_ids’

while executing this code.

model = AutoModelForQuestionAnswering.from_pretrained(“…/my_awesome_qa_model”)
with torch.no_grad():
outputs = model(**inputs)

I figured it out. It’s because of the different tokenizer. Please ignore.

Hi, I have been following the course. Right now I´m in chapter 7 (Main NLP tasks) section “token classification”. I ran the notebook correctly but I got an error in the section where you propose to take a look at the full training loop. When we are “Preparing everything for training”, you wrote:
“Lastly, to push our model to the Hub, we will need to create a Repository object in a working folder. First log in to Hugging Face, if you’re not logged in already. We’ll determine the repository name from the model ID we want to give our model (feel free to replace the repo_name with your own choice; it just needs to contain your username, which is what the function get_full_repo_name() does):”

It is here that the error arose:

Effectively, the repository doesn’t exist in my account, is this correct? I thought Repository() would be in charge of cloning it. Shouldn´t the repository be created before cloning it? or should I have cloned it manually before running this cell?

Thank you in advance for any help.

I ran several times the “training a causal language model from scratch”, both on my local computer and on colab, the only modification being that I use resume_from_checkpoint=True, and the result is the same : the model trains but gives bad results like

# create scatter plot with x, y
fig, ax = plt.subplots()
ax.scatter(

I would not advise taking too much time on training it.

In this chapter there is a sub-charter dealing with training causal language model from scratch. Here is one quote from this:

:pencil2: Try it out! Getting rid of all the chunks that are smaller than the context size wasn’t a big issue here because we’re using small context windows. As you increase the context size (or if you have a corpus of short documents), the fraction of chunks that are thrown away will also grow. A more efficient way to prepare the data is to join all the tokenized samples in a batch with an eos_token_id token in between, and then perform the chunking on the concatenated sequences. As an exercise, modify the tokenize() function to make use of that approach. Note that you’ll want to set truncation=False and remove the other arguments from the tokenizer to get the full sequence of token IDs.

I have two questions:

  1. I have exactly a use case like this. I can’t throw away the last not long enough chunks since my documents have comparable size to the context window size. I am a beginner with NLP models. Is there any notebook or blog post where I can check my solution to the question.
  2. Why DataCollatorForLanguageModeling(tokenizer, mlm=False) returns also with an attension_mask which is not relevant after having constant chunk sizes (we eliminate the last chunk if it is shorter). Since we have no padding, the masks seems to me pointless.

Issue with DefunctDatasetError in Chapter 7 Colab Example

Hello,

I encountered a DefunctDatasetError while executing the Colab example linked in Chapter 7. According to the error message, the “amazon_reviews_multi” dataset is no longer available, which seems to be causing this issue. Could you please confirm if this dataset has been replaced with another one? If so, could you provide the name of the alternative dataset?

Thank you for your assistance.

2 Likes

In compute_metrics of the Translation tutorial, we convert the labels that have -100 to the pad_token_id

labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

However, why don’t we do the same for the predictions? I understand that the DataCollatorForSeq2Seq converts the padded tokens to -100 as well. I encountered this because in my actual use-case (running a facebook/bart-large model), I was getting errors priors to doing the same conversion on predictions.

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

I’m facing the same issue.

“DefunctDatasetError: Dataset ‘amazon_reviews_multi’ is defunct and no longer accessible due to the decision of data providers”

Does someone have the dataset downloaded? If so, can you please share a link (e.g., Google Drive / Dropbox)?

@sgugger Mentioning you to bring this issue to your attention.

I chose another dataset because I didn’t need to use multiligual model only.
I just made a simple practice of chapter7. Please refer to the example code below

#install library
!pip install transformers datasets rouge-score

#2. load model & tokenizer : facebook/bart-large-cnn
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(“facebook/bart-large-cnn”)
model = AutoModelForSeq2SeqLM.from_pretrained(“facebook/bart-large-cnn”)

#3. load dataset : cnn_dailymail

from datasets import load_dataset
dataset = load_dataset(“cnn_dailymail”, “3.0.0”)

#dataset structure

dataset

use the first data of train dataset

example_text = dataset[“train”][0][“article”]

example text print

print(“Original Text:”)
print(example_text)

#Tokenizer and data processing
input_ids = tokenizer.encode(example_text, return_tensors=“pt”, max_length=1024, truncation=True)

#Generating summary
summary_ids = model.generate(input_ids, max_length=150, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)

#Decode generated summary
generated_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

evaluation of the example

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer([‘rouge1’, ‘rouge2’, ‘rougeL’], use_stemmer=True)
rouge_scores = scorer.score(example_text, generated_summary)

print("Original Text: ")
print(example_text)

print("Summary Text: ")
print(generated_summary)

print("ROUGE scores: ")
print(rouge_scores)

1 Like

Nice! Thanks for sharing!

1 Like

Hi, I followed the code from the lesson and have the following error:

RuntimeError: Could not infer dtype of NoneType

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[21], line 3
      1 import math
----> 3 eval_results = trainer.evaluate()
      4 print(f">>> Perplexity before training: {math.exp(eval_results['eval_loss']):.2f}")

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:3011, in Trainer.evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
   3008 start_time = time.time()
   3010 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
-> 3011 output = eval_loop(
   3012     eval_dataloader,
   3013     description="Evaluation",
   3014     # No point gathering the predictions if there are no metrics, otherwise we defer to
   3015     # self.args.prediction_loss_only
   3016     prediction_loss_only=True if self.compute_metrics is None else None,
   3017     ignore_keys=ignore_keys,
   3018     metric_key_prefix=metric_key_prefix,
   3019 )
   3021 total_batch_size = self.args.eval_batch_size * self.args.world_size
   3022 if f"{metric_key_prefix}_jit_compilation_time" in output.metrics:

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:3190, in Trainer.evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
   3188 observed_num_examples = 0
   3189 # Main evaluation loop
-> 3190 for step, inputs in enumerate(dataloader):
   3191     # Update the observed num examples
   3192     observed_batch_size = find_batch_size(inputs)
   3193     if observed_batch_size is not None:

File /opt/conda/lib/python3.10/site-packages/accelerate/data_loader.py:384, in DataLoaderShard.__iter__(self)
    382 # We iterate one batch ahead to check when we are at the end
    383 try:
--> 384     current_batch = next(dataloader_iter)
    385 except StopIteration:
    386     yield

File /opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py:633, in _BaseDataLoaderIter.__next__(self)
    630 if self._sampler_iter is None:
    631     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    632     self._reset()  # type: ignore[call-arg]
--> 633 data = self._next_data()
    634 self._num_yielded += 1
    635 if self._dataset_kind == _DatasetKind.Iterable and \
    636         self._IterableDataset_len_called is not None and \
    637         self._num_yielded > self._IterableDataset_len_called:

File /opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py:677, in _SingleProcessDataLoaderIter._next_data(self)
    675 def _next_data(self):
    676     index = self._next_index()  # may raise StopIteration
--> 677     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    678     if self._pin_memory:
    679         data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

File /opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py:54, in _MapDatasetFetcher.fetch(self, possibly_batched_index)
     52 else:
     53     data = self.dataset[possibly_batched_index]
---> 54 return self.collate_fn(data)

File /opt/conda/lib/python3.10/site-packages/transformers/data/data_collator.py:45, in DataCollatorMixin.__call__(self, features, return_tensors)
     43     return self.tf_call(features)
     44 elif return_tensors == "pt":
---> 45     return self.torch_call(features)
     46 elif return_tensors == "np":
     47     return self.numpy_call(features)

File /opt/conda/lib/python3.10/site-packages/transformers/data/data_collator.py:732, in DataCollatorForLanguageModeling.torch_call(self, examples)
    729 def torch_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
    730     # Handle dict or lists with proper padding and conversion to tensor.
    731     if isinstance(examples[0], Mapping):
--> 732         batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
    733     else:
    734         batch = {
    735             "input_ids": _torch_collate_batch(examples, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)
    736         }

File /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3295, in PreTrainedTokenizerBase.pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
   3292             batch_outputs[key] = []
   3293         batch_outputs[key].append(value)
-> 3295 return BatchEncoding(batch_outputs, tensor_type=return_tensors)

File /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:223, in BatchEncoding.__init__(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
    219     n_sequences = encoding[0].n_sequences
    221 self._n_sequences = n_sequences
--> 223 self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)

File /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:764, in BatchEncoding.convert_to_tensors(self, tensor_type, prepend_batch_axis)
    759         if key == "overflowing_tokens":
    760             raise ValueError(
    761                 "Unable to create tensor returning overflowing tokens of different lengths. "
    762                 "Please see if a fast version of this tokenizer is available to have this feature available."
    763             ) from e
--> 764         raise ValueError(
    765             "Unable to create tensor, you should probably activate truncation and/or padding with"
    766             " 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your"
    767             f" features (`{key}` in this case) have excessive nesting (inputs type `list` where type `int` is"
    768             " expected)."
    769         ) from e
    771 return self

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`word_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
```

If I comment out these lines in the tokenize_function:

```
if tokenizer.is_fast:
    result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
```
then it works. However, I cannot use whole_word_masking_data_collator if I do so.
Do you know how to fix the error?

The following error occurred, but I don’t know why

Error.

This method.
14 def decorate_autocast(*args, **kwargs):
15 autocast_instance:
---> 16 return func(*args, **kwargs).
17 
18 decorate_autocast.__script_unsupported = "@autocast() decorator is not supported in script mode" # type: ignore[attr-defined] > 18

TypeError: BertForMaskedLM.forward() received an unexpected keyword argument 'masked_token_type_ids'.

Location where it occurred
Evaluation section.

torch.no_grad():
 output = model(**batch)#here.

 losses = outputs.losses
 losses.append(accelerator.gather(loss.repeat(batch_size)))
 outputs = model(**batch)

What we tried.
Specify the outputs yourself

input = { > 'input_ids': batch
'input_ids': batch['input_ids'],
'attention_mask': batch['attention_mask'],,
'token_type_ids': batch['token_type_ids'], > 'token_type_ids',
}
output = model(**inputs)

Use other than AutoModelForMaskedLM.

model name = "cl-tohoku/bert-base-japanese"
Tokenizer = BertJapaneseTokenizer.from_pretrained(model_name)
Model = BertForMaskedLM.from_pretrained(model_name)

DefunctDatasetError: Dataset ‘amazon_reviews_multi’ is defunct and no longer accessible due to the decision of data providers

from datasets import load_dataset

spanish_dataset = load_dataset("amazon_reviews_multi", "es")
english_dataset = load_dataset("amazon_reviews_multi", "en")
print(english_dataset)

Can I use PyTorch to fine-tune the translation model? I tested it with “Helsinki-NLP/opus-mt-ja-en” and use sacrbleu matric to evaluate. It results just 0.08. In the original model card, it is said it got 41.7. I glance the source code of the model (marianMT) and I found that it uses tf framework.

Therefore, can I use PyTorch instead of TF to fine-tune the model?

DefunctDatasetError: Dataset ‘amazon_reviews_multi’ is defunct and no longer accessible due to the decision of data providers

Is there any plans to update the documentation for Chapter 7 summarization?

I will work with another dataset in the meantime but it would be nice to change it for other people taking the course in the future.

I have a question regarding the fine-tuning of a causalLM model (Llama2) with the trainer. I set the seed 42 but the train and evaluation loss differ across runs when fine-tuning Llama2 on my dataset. This behavior is only observable with Llama2 (I tried it with Mistral and there was the loss always the same). May this problem be related to the model.generation_config file? is there any non-deterministic behavior within the train loop?

My program reports the same error.

Traceback (most recent call last):
  File "/home/ssfc/huggingface-tutorial/7.5.py", line 3, in <module>
    spanish_dataset = load_dataset("amazon_reviews_multi", "es")
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ssfc/anaconda3/lib/python3.11/site-packages/datasets/load.py", line 2548, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ssfc/anaconda3/lib/python3.11/site-packages/datasets/load.py", line 2257, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/home/ssfc/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 382, in __init__
    info.update(self._info())
                ^^^^^^^^^^^^
  File "/home/ssfc/.cache/huggingface/modules/datasets_modules/datasets/amazon_reviews_multi/30d298e0990e8a7e143dc917665441883d7e0b6a64516ef7eea2e89c1af1755c/amazon_reviews_multi.py", line 91, in _info
    raise DefunctDatasetError(
datasets.exceptions.DefunctDatasetError: Dataset 'amazon_reviews_multi' is defunct and no longer accessible due to the decision of data providers

I want to ask if any updates has been made to the text summarization course with regards to the now defunct dataset used in the course