Hi.
Project-1:
As part of domain adaptation, I fine-tuned the distilbert-base-uncased model on IMDB. I used the tokenizer as it comes from distilbert-base-uncased. The Perplexity score on the evaluation I got after 3 epochs of training on a down sampled training set of 10k records: 10.93
Project-2:
Train a WordPiece tokenizer just from IMDB reviews( training+unsupervised data only. Lets leave test and see how it works)
Develop a Masked Language Model using this tokenizer
Evaluate performance using Perplexity
After 5 epochs of training on 40k records, I got a perplexity score of 103.52
Questions:
Why did I get a higher perplexity score when I started from a custom tokenizer?
Do I need to pre-process texts in any ways that may improve results?
When we have a custom dataset that is different than the original dataset, what is the recommendation? For example it is not likely to find terms like deductible, coinsurance in a movie review data set, but very likely in an insurance-side corpus. Do you suggest we still use the original tokenizer and just focus on fine-tuning?
Hi, I am learning Chapter07-Token classification. The content returned by running the following code is inconsistent with the content of the tutorial, and the content of the context is not decode. Could you please tell me the reason?I run on IDEA
from datasets import load_dataset
from transformers import AutoTokenizer
my output:
‘[CLS] to whom did the virgin mary allegedly appear in 1858 in lourdes france? [SEP] [UNK] [SEP]’
output from tutorial:
'[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, ’
'the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin ’
'Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms ’
'upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred ’
'Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a ’
'replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette ’
'Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues ’
‘and the Gold Dome ), is a simple, modern stone statue of Mary. [SEP]’
Hi! I’ve found what appears to be a contradiction between Chapter 7 of Course and the documentation. When explaining the way to build both id2label and label2id dicts, it is stated in the Course:
id2label = {str(i): label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}
That is, id2label would be Dict[str, str] and label2id would be Dict[str, str]. However, according to the documentation, in section Parameters for fine-tuning tasks the same dictionaries are defined as follows:
id2label (Dict[int, str], optional) — A map from index (for instance prediction index, or target index) to label.
label2id (Dict[str, int], optional) — A map from label to index for the model.
Am I just missing something or there is something wrong here?
Thanks for spotting this @clanofsol ! In practice, it doesn’t have any effect whether the IDs are str or int, but I agree we should change the course description for consistency
Hi. Thank you for the Translation tutorial. It’s really clear and easy to follow.
Could you give some advice on how to adapt the code in the Colab notebook associated with the tutorial (i.e. Google Colab) for training ByT5 for translation?
Can I simply set the “model_checkpoint” to “google/byt5-small”, change the model name to"byt5-finetuned-kde4-en-to-fr" in the args section, and run the code as is? Or do I also need to adjust the “preprocess_function”?
In Section 2, you mention that Token Classification is a generic task and Entity Recognition or POS Tagging are examples of a token classification problem.
However, I don’t know if this is the right approach when classifying a group of tokens with fuzzy boundaries (i.e. tokens which aren’t entities but anything else within a sequence)
Suppose that I have a dataset with the label question where different human annotators have produced the following data:
I need to know the [ time travel from Earth to Mars ] please I want to know [ the time travel from Earth to Mars ] How long does it take to go [ from Earth to Mars ] please give me the answer
Tokens between brackets are labeled using question. Is Token Classification the way to go? If so, is it advisable to change any parameter when training the model?
Hi,
I am new to hugging face , hence trying to understand by running examples given on colab notebook.
hence I would like to know some parts of the code, could you please help me understanding it better, your help will be greatly appreciated .
colab provided in “Fine-tune a pretrained model” . has below errors
AttributeError Traceback (most recent call last)
in 4 model.eval() 5 for batch in eval_dataloader: ----> 6 batch = {k: v.to(device) for k, v in batch.items()} 7 with torch.no_grad(): 8 outputs = model(**batch)
in (.0) 4 model.eval() 5 for batch in eval_dataloader: ----> 6 batch = {k: v.to(device) for k, v in batch.items()} 7 with torch.no_grad(): 8 outputs = model(**batch)
AttributeError: ‘list’ object has no attribute ‘to’
How does the example script manage to train in a couple of minutes? I have run the script on google colab but the estimated training time is shown to be more than 2 hours.
I have tried passing the model to GPU but I am not sure how to pass the inputs to device as they are Dataset objects. The example script did not attempt anything with the GPU though. What shall I do to speed up the training process?
it is the token classification notebook using distilbert. the trainer here completed 3 epochs of training in 1 min 45 sec:
[2634/2634 01:45, Epoch 3/3]
Hello! I’m working to fine-tune BERT on a sensitive and private dataset. Before fine-tuning, this tutorial is asking me to log into the HuggingFace hub. Why is this?
Also, while fine-tuning and using PushToHubCallback and setting up my datasets, how can I make sure that my data and model remains secure and private?
In the “token classification” part of chapter 7 when we define tokenize and_align_labels function we assume that the parameter will be a batch of example, whereas in chapter 3 “processing the data” section when we define tokenize function in chapter 3 we assume we will get only one example as parameter.
Although we apply this two function similarly on some raw_datasets through map function with batch = True.
Hello Hugging Face !
Thanks to @lewtun and @sgugger for the great conversations in this thread and the provided course.
The SQuADv2.0 dataset introduces impossible questions and tests models on how well they can decide whether a given question has an answer or not.
Chapter 7 only focuses on QA using a SQuADv1.1-like dataset - without emphasizing on the special case of there always being an answer present in the context.
Now my question is, if there is a standardized way - using Hugging Face’s libraries - to handle this task. Many approaches use, for example, a confidence threshold for the null response (predicting span [0,0]) that has to be surpassed in order to actually predict that there is no answer or even weaken the logit scores by a given factor.
It’s hard to actually find ressources on this using the Hugging Face suite of tools.
Also another or updated course chapter would be helpful on this!
Thank you!
I want to use my own dataset for translation and it is using a fictitious language and English. I saw the format for the English-French dataset, having “id” and “translation” as column headers, and the following format: “0” { “en”: “Lauri Watts”, “fr”: “Lauri Watts” }
Can I simply follow this format with my own csv? The first line in the csv being “id” and “translation”, followed by my data, for ex: “0” { “en”: “hold still.”, “fict”: “hagwa yatuka.” }
I got it working using another model, more or less, and have its checkpoint saved on my Google drive, but I am not sure how I can use it now for inference. This tutorial pushes the model to huggingface and then goes into using the pipeline for inference, so I was going to use this instead, but the dataset issue is holding me back.
If Decoder-only architectures (e.g. GPT-3) are designed for Text Generation tasks, then how are they used for Classification/Translation tasks in these papers (https://arxiv.org/pdf/2102.09690.pdf, https://arxiv.org/pdf/2005.14165.pdf), without even fine-tuning any parameters (e.g. the head)? I mean how can the same model (e.g. GPT-3) be used for various tasks (e.g. text classification or text generation or translation) without any fine-tuning?
Hi, I’m getting error while fine-tuning with accelerate. I’m following the tutorial code as it is. I’m able to push to hub with trainer API but not Accelerate.