Chapter 7 questions

Hi.
Project-1:
As part of domain adaptation, I fine-tuned the distilbert-base-uncased model on IMDB. I used the tokenizer as it comes from distilbert-base-uncased. The Perplexity score on the evaluation I got after 3 epochs of training on a down sampled training set of 10k records: 10.93
Project-2:

  1. Train a WordPiece tokenizer just from IMDB reviews( training+unsupervised data only. Lets leave test and see how it works)
  2. Develop a Masked Language Model using this tokenizer
  3. Evaluate performance using Perplexity
    After 5 epochs of training on 40k records, I got a perplexity score of 103.52

Questions:

  1. Why did I get a higher perplexity score when I started from a custom tokenizer?
  2. Do I need to pre-process texts in any ways that may improve results?
  3. When we have a custom dataset that is different than the original dataset, what is the recommendation? For example it is not likely to find terms like deductible, coinsurance in a movie review data set, but very likely in an insurance-side corpus. Do you suggest we still use the original tokenizer and just focus on fine-tuning?

How large can the text be for extractive question answering?
How can I improve the speed the algorithm takes to find the answer given a large text?

Thanks in advance!

Hi, @lewtun, when running this code on colab:

model_checkpoint = "/content/marian-finetuned-kde4-en-to-fr"
translator = pipeline("translation", model=model_checkpoint,)

I get this error:

---------------------------------------------------------------------------

InvalidArgumentError                      Traceback (most recent call last)

<ipython-input-61-0228496e37cb> in <module>
      3 # Replace this with your own checkpoint
      4 model_checkpoint = "/content/marian-finetuned-kde4-en-to-fr"
----> 5 translator = pipeline("translation", model=model_checkpoint,)
      6 # print(f"original language: {en_enc[100]}")
      7 # print(f"original translation: {fr_enc[100]}")

10 frames

/usr/local/lib/python3.7/dist-packages/transformers/models/marian/modeling_tf_marian.py in call(self, input_ids, inputs_embeds, attention_mask, head_mask, output_attentions, output_hidden_states, return_dict, training)
    785 
    786         if inputs_embeds is None:
--> 787             inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
    788 
    789         embed_pos = self.embed_positions(input_shape)

InvalidArgumentError: Exception encountered when calling layer "encoder" (type TFMarianEncoder).

cannot compute Mul as input #1(zero-based) was expected to be a half tensor but is a float tensor [Op:Mul]

Call arguments received:
  • input_ids=tf.Tensor(shape=(3, 5), dtype=int32)
  • inputs_embeds=None
  • attention_mask=tf.Tensor(shape=(3, 5), dtype=bool)
  • head_mask=None
  • output_attentions=False
  • output_hidden_states=False
  • return_dict=True
  • training=False

can you please help me fix this?

Hi, I am learning Chapter07-Token classification. The content returned by running the following code is inconsistent with the content of the tutorial, and the content of the context is not decode. Could you please tell me the reason?I run on IDEA

from datasets import load_dataset
from transformers import AutoTokenizer

raw_datasets = load_dataset(“squad”)
raw_datasets[“train”].filter(lambda x: len(x[“answers”][“text”]) != 1)
model_checkpoint = “bert-base-cased”
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
context = raw_datasets[“train”][0][“context”]
question = raw_datasets[“train”][0][“question”]
inputs = tokenizer(question, context)
tokenizer.decode(inputs[“input_ids”])

my output:
‘[CLS] to whom did the virgin mary allegedly appear in 1858 in lourdes france? [SEP] [UNK] [SEP]’

output from tutorial:
'[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, ’
'the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin ’
'Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms ’
'upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred ’
'Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a ’
'replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette ’
'Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues ’
‘and the Gold Dome ), is a simple, modern stone statue of Mary. [SEP]’

Hi! I’ve found what appears to be a contradiction between Chapter 7 of Course and the documentation. When explaining the way to build both id2label and label2id dicts, it is stated in the Course:

id2label = {str(i): label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

That is, id2label would be Dict[str, str] and label2id would be Dict[str, str]. However, according to the documentation, in section Parameters for fine-tuning tasks the same dictionaries are defined as follows:

id2label (Dict[int, str], optional) — A map from index (for instance prediction index, or target index) to label.
label2id (Dict[str, int], optional) — A map from label to index for the model.

Am I just missing something or there is something wrong here?

Thanks for spotting this @clanofsol ! In practice, it doesn’t have any effect whether the IDs are str or int, but I agree we should change the course description for consistency :slight_smile:

I’ll patch a fix for this.

Edit: PR with the fix here

1 Like

Hi. Thank you for the Translation tutorial. It’s really clear and easy to follow.

Could you give some advice on how to adapt the code in the Colab notebook associated with the tutorial (i.e. Google Colab) for training ByT5 for translation?

Can I simply set the “model_checkpoint” to “google/byt5-small”, change the model name to"byt5-finetuned-kde4-en-to-fr" in the args section, and run the code as is? Or do I also need to adjust the “preprocess_function”?

Thanks in advance for your help!

Hello,

Thanks for the course materials!

In Section 2, you mention that Token Classification is a generic task and Entity Recognition or POS Tagging are examples of a token classification problem.

However, I don’t know if this is the right approach when classifying a group of tokens with fuzzy boundaries (i.e. tokens which aren’t entities but anything else within a sequence)

Suppose that I have a dataset with the label question where different human annotators have produced the following data:

I need to know the [ time travel from Earth to Mars ] please
I want to know [ the time travel from Earth to Mars ]
How long does it take to go [ from Earth to Mars ] please give me the answer

Tokens between brackets are labeled using question. Is Token Classification the way to go? If so, is it advisable to change any parameter when training the model?

Thanks!

I’m getting a ““targets” is not defined” error with the following line on colab “Summarization (PyTorch)”.

labels = tokenizer(text_target=targets, max_length=max_target_length, truncation=True)

Hi,
I am new to hugging face , hence trying to understand by running examples given on colab notebook.

hence I would like to know some parts of the code, could you please help me understanding it better, your help will be greatly appreciated .

colab provided in “Fine-tune a pretrained model” . has below errors


AttributeError Traceback (most recent call last)

in 4 model.eval() 5 for batch in eval_dataloader: ----> 6 batch = {k: v.to(device) for k, v in batch.items()} 7 with torch.no_grad(): 8 outputs = model(**batch)

in (.0) 4 model.eval() 5 for batch in eval_dataloader: ----> 6 batch = {k: v.to(device) for k, v in batch.items()} 7 with torch.no_grad(): 8 outputs = model(**batch)

AttributeError: ‘list’ object has no attribute ‘to’

How does the example script manage to train in a couple of minutes? I have run the script on google colab but the estimated training time is shown to be more than 2 hours.

I have tried passing the model to GPU but I am not sure how to pass the inputs to device as they are Dataset objects. The example script did not attempt anything with the GPU though. What shall I do to speed up the training process?

Hello @Rocketknight1
i believe that i have the same issue here since i waited for hours until PushToHubCallback running successfully
But no result

Hi @DenysBarry can you please clarify which section of this chapter you’re referring to (preferably with a link)? Thanks!

it is the token classification notebook using distilbert. the trainer here completed 3 epochs of training in 1 min 45 sec:
[2634/2634 01:45, Epoch 3/3]

it is under the section ‘fine-tuning the model’.

Hello! I’m working to fine-tune BERT on a sensitive and private dataset. Before fine-tuning, this tutorial is asking me to log into the HuggingFace hub. Why is this?

Also, while fine-tuning and using PushToHubCallback and setting up my datasets, how can I make sure that my data and model remains secure and private?

In the “token classification” part of chapter 7 when we define tokenize and_align_labels function we assume that the parameter will be a batch of example, whereas in chapter 3 “processing the data” section when we define tokenize function in chapter 3 we assume we will get only one example as parameter.
Although we apply this two function similarly on some raw_datasets through map function with batch = True.

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

Hello Hugging Face :hugs:!
Thanks to @lewtun and @sgugger for the great conversations in this thread and the provided course.

The SQuADv2.0 dataset introduces impossible questions and tests models on how well they can decide whether a given question has an answer or not.
Chapter 7 only focuses on QA using a SQuADv1.1-like dataset - without emphasizing on the special case of there always being an answer present in the context.

Now my question is, if there is a standardized way - using Hugging Face’s libraries - to handle this task. Many approaches use, for example, a confidence threshold for the null response (predicting span [0,0]) that has to be surpassed in order to actually predict that there is no answer or even weaken the logit scores by a given factor.
It’s hard to actually find ressources on this using the Hugging Face suite of tools.

Also another or updated course chapter would be helpful on this!
Thank you!

Hello,

I want to use my own dataset for translation and it is using a fictitious language and English. I saw the format for the English-French dataset, having “id” and “translation” as column headers, and the following format: “0” { “en”: “Lauri Watts”, “fr”: “Lauri Watts” }

Can I simply follow this format with my own csv? The first line in the csv being “id” and “translation”, followed by my data, for ex: “0” { “en”: “hold still.”, “fict”: “hagwa yatuka.” }

I got it working using another model, more or less, and have its checkpoint saved on my Google drive, but I am not sure how I can use it now for inference. This tutorial pushes the model to huggingface and then goes into using the pipeline for inference, so I was going to use this instead, but the dataset issue is holding me back.

Thank you for your help :slight_smile:

If Decoder-only architectures (e.g. GPT-3) are designed for Text Generation tasks, then how are they used for Classification/Translation tasks in these papers (https://arxiv.org/pdf/2102.09690.pdf, https://arxiv.org/pdf/2005.14165.pdf), without even fine-tuning any parameters (e.g. the head)? I mean how can the same model (e.g. GPT-3) be used for various tasks (e.g. text classification or text generation or translation) without any fine-tuning?

Hi, I’m getting error while fine-tuning with accelerate. I’m following the tutorial code as it is. I’m able to push to hub with trainer API but not Accelerate.

Getting error in this code:

from huggingface_hub import Repository
output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name)

Error:
‘git clone’ has been updated in upstream Git to have comparable
speeds to ‘git lfs clone’.
Cloning into ‘.’…
remote: Repository not found
fatal: repository ‘https://huggingface.co/sarthakc44/distilbert-base-uncased-finetuned-imdb-accelerate/’ not found
Error(s) during clone:
git clone failed: exit status 128