Hi.
Project-1:
As part of domain adaptation, I fine-tuned the distilbert-base-uncased model on IMDB. I used the tokenizer as it comes from distilbert-base-uncased. The Perplexity score on the evaluation I got after 3 epochs of training on a down sampled training set of 10k records: 10.93
Project-2:
Train a WordPiece tokenizer just from IMDB reviews( training+unsupervised data only. Lets leave test and see how it works)
Develop a Masked Language Model using this tokenizer
Evaluate performance using Perplexity
After 5 epochs of training on 40k records, I got a perplexity score of 103.52
Questions:
Why did I get a higher perplexity score when I started from a custom tokenizer?
Do I need to pre-process texts in any ways that may improve results?
When we have a custom dataset that is different than the original dataset, what is the recommendation? For example it is not likely to find terms like deductible, coinsurance in a movie review data set, but very likely in an insurance-side corpus. Do you suggest we still use the original tokenizer and just focus on fine-tuning?
Hi, I am learning Chapter07-Token classification. The content returned by running the following code is inconsistent with the content of the tutorial, and the content of the context is not decode. Could you please tell me the reason?I run on IDEA
from datasets import load_dataset
from transformers import AutoTokenizer
my output:
â[CLS] to whom did the virgin mary allegedly appear in 1858 in lourdes france? [SEP] [UNK] [SEP]â
output from tutorial:
'[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, â
'the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin â
'Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms â
'upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred â
'Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a â
'replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette â
'Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues â
âand the Gold Dome ), is a simple, modern stone statue of Mary. [SEP]â
Hi! Iâve found what appears to be a contradiction between Chapter 7 of Course and the documentation. When explaining the way to build both id2label and label2id dicts, it is stated in the Course:
id2label = {str(i): label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}
That is, id2label would be Dict[str, str] and label2id would be Dict[str, str]. However, according to the documentation, in section Parameters for fine-tuning tasks the same dictionaries are defined as follows:
id2label (Dict[int, str], optional) â A map from index (for instance prediction index, or target index) to label.
label2id (Dict[str, int], optional) â A map from label to index for the model.
Am I just missing something or there is something wrong here?
Thanks for spotting this @clanofsol ! In practice, it doesnât have any effect whether the IDs are str or int, but I agree we should change the course description for consistency
Hi. Thank you for the Translation tutorial. Itâs really clear and easy to follow.
Could you give some advice on how to adapt the code in the Colab notebook associated with the tutorial (i.e. Google Colab) for training ByT5 for translation?
Can I simply set the âmodel_checkpointâ to âgoogle/byt5-smallâ, change the model name to"byt5-finetuned-kde4-en-to-fr" in the args section, and run the code as is? Or do I also need to adjust the âpreprocess_functionâ?
In Section 2, you mention that Token Classification is a generic task and Entity Recognition or POS Tagging are examples of a token classification problem.
However, I donât know if this is the right approach when classifying a group of tokens with fuzzy boundaries (i.e. tokens which arenât entities but anything else within a sequence)
Suppose that I have a dataset with the label question where different human annotators have produced the following data:
I need to know the [ time travel from Earth to Mars ] please I want to know [ the time travel from Earth to Mars ] How long does it take to go [ from Earth to Mars ] please give me the answer
Tokens between brackets are labeled using question. Is Token Classification the way to go? If so, is it advisable to change any parameter when training the model?
Hi,
I am new to hugging face , hence trying to understand by running examples given on colab notebook.
hence I would like to know some parts of the code, could you please help me understanding it better, your help will be greatly appreciated .
colab provided in âFine-tune a pretrained modelâ . has below errors
AttributeError Traceback (most recent call last)
in 4 model.eval() 5 for batch in eval_dataloader: ----> 6 batch = {k: v.to(device) for k, v in batch.items()} 7 with torch.no_grad(): 8 outputs = model(**batch)
in (.0) 4 model.eval() 5 for batch in eval_dataloader: ----> 6 batch = {k: v.to(device) for k, v in batch.items()} 7 with torch.no_grad(): 8 outputs = model(**batch)
AttributeError: âlistâ object has no attribute âtoâ
How does the example script manage to train in a couple of minutes? I have run the script on google colab but the estimated training time is shown to be more than 2 hours.
I have tried passing the model to GPU but I am not sure how to pass the inputs to device as they are Dataset objects. The example script did not attempt anything with the GPU though. What shall I do to speed up the training process?
it is the token classification notebook using distilbert. the trainer here completed 3 epochs of training in 1 min 45 sec:
[2634/2634 01:45, Epoch 3/3]
it is under the section âfine-tuning the modelâ.
Hello! Iâm working to fine-tune BERT on a sensitive and private dataset. Before fine-tuning, this tutorial is asking me to log into the HuggingFace hub. Why is this?
Also, while fine-tuning and using PushToHubCallback and setting up my datasets, how can I make sure that my data and model remains secure and private?
In the âtoken classificationâ part of chapter 7 when we define tokenize and_align_labels function we assume that the parameter will be a batch of example, whereas in chapter 3 âprocessing the dataâ section when we define tokenize function in chapter 3 we assume we will get only one example as parameter.
Although we apply this two function similarly on some raw_datasets through map function with batch = True.
Hello Hugging Face !
Thanks to @lewtun and @sgugger for the great conversations in this thread and the provided course.
The SQuADv2.0 dataset introduces impossible questions and tests models on how well they can decide whether a given question has an answer or not.
Chapter 7 only focuses on QA using a SQuADv1.1-like dataset - without emphasizing on the special case of there always being an answer present in the context.
Now my question is, if there is a standardized way - using Hugging Faceâs libraries - to handle this task. Many approaches use, for example, a confidence threshold for the null response (predicting span [0,0]) that has to be surpassed in order to actually predict that there is no answer or even weaken the logit scores by a given factor.
Itâs hard to actually find ressources on this using the Hugging Face suite of tools.
Also another or updated course chapter would be helpful on this!
Thank you!
I want to use my own dataset for translation and it is using a fictitious language and English. I saw the format for the English-French dataset, having âidâ and âtranslationâ as column headers, and the following format: â0â { âenâ: âLauri Wattsâ, âfrâ: âLauri Wattsâ }
Can I simply follow this format with my own csv? The first line in the csv being âidâ and âtranslationâ, followed by my data, for ex: â0â { âenâ: âhold still.â, âfictâ: âhagwa yatuka.â }
I got it working using another model, more or less, and have its checkpoint saved on my Google drive, but I am not sure how I can use it now for inference. This tutorial pushes the model to huggingface and then goes into using the pipeline for inference, so I was going to use this instead, but the dataset issue is holding me back.
If Decoder-only architectures (e.g. GPT-3) are designed for Text Generation tasks, then how are they used for Classification/Translation tasks in these papers (https://arxiv.org/pdf/2102.09690.pdf, https://arxiv.org/pdf/2005.14165.pdf), without even fine-tuning any parameters (e.g. the head)? I mean how can the same model (e.g. GPT-3) be used for various tasks (e.g. text classification or text generation or translation) without any fine-tuning?
Hi, Iâm getting error while fine-tuning with accelerate. Iâm following the tutorial code as it is. Iâm able to push to hub with trainer API but not Accelerate.