Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering

joeddav · August 17, 2020, 2:01pm

Interested in fine-tuning on your own custom datasets but unsure how to get going? I just added a tutorial to the docs with several examples that each walk you through downloading a dataset, preprocessing & tokenizing, and training with either Trainer, native PyTorch, or native TensorFlow 2.

Examples include:

Sequence classification (sentiment) – IMDb
Token classification (NER) – W-NUT Emerging and Rare entities
Question answering (span selection) – SQuAD 2.0

Click the Open in Colab button at the top to open a colab notebook in either TF or PT. This tutorial demonstrates one workflow for working with custom datasets, but there are many valid ways to accomplish the same thing. The intention is to be demonstrative rather than definitive. Also, we highly recommend you check out and contribute to our NLP datasets & metrics library for easy access 150+ datasets.

Tutorial: https://huggingface.co/transformers/master/custom_datasets.html

Feedback and questions welcome!

rbint · August 17, 2020, 8:17pm

I spotted a minor typo. “…which we can use for for evaluation and tuning without taining our test set results.” I believe you meant to say tainting.

Otherwise, great tutorial. I’m looking forward to digging in more.

smalltoken · August 17, 2020, 9:07pm

thanks for contributing!
May I ask if I don’t have any label or relationship between sentences, could I fine tune a bert model by masked language model task?

joeddav · August 17, 2020, 10:01pm

Of course! At the bottom of the tutorial we actually link to a blog post that shows you how to do just that https://huggingface.co/blog/how-to-train

stefan-jo · August 18, 2020, 3:19pm

Thanks a lot for creating the tutorial @joeddav!

I ran into an issue with tokenizer. It seems like I cannot just pass my list of texts to tokenizer like in the tutorial. Am I doing something wrong?

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(test_texts, truncation=True, padding=True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-77-d270a8ea6e50> in <module>
----> 1 train_encodings = tokenizer(train_texts, truncation=True, padding=True)
  2 val_encodings = tokenizer(test_texts, truncation=True, padding=True)

TypeError: 'DistilBertTokenizerFast' object is not callable

rgwatwormhill · August 18, 2020, 3:38pm

Hi stefan-jo

what version of transformers are you using? You might need version 3. See this issue https://github.com/huggingface/transformers/issues/5931 , which says that transformers 2.3.0 does not have callable tokenizers.

stefan-jo · August 18, 2020, 3:50pm

Yes, that was it. I installed from source, restarted the kernel and now it’s working

Thanks @rgwatwormhill

stefan-jo · August 19, 2020, 9:53am

I have two (very basic) questions:

I suppose in the tutorial the entire model is being fine-tuned at once. Is there an easy way to first train only the classification head and only then unfreeze the entire model?
Is the classification head in BertForSequenceClassification pre-trained or initialized randomly on top of BertModel? If pre-trained, which task/dataset has been used for pre-training?

Note: I’ve been using BERT instead of DistilBERT, but I guess the same applies to both.

sgugger · August 19, 2020, 12:37pm

For 1, you can look in the training tutorial where there is an example in PyTorch.
For 2, the head is initialized randomly since we are using a checkpoint of the base model, it would be pretrained if we used a checkpoint that has been fine-tuned for sequence classification like distilbert-base-uncased-finetuned-sst-2-english.

stefan-jo · August 19, 2020, 12:58pm

Thank you for your answer! I’ll check out the tutorial.

abdallah197 · November 19, 2020, 3:50pm

Hi @sgugger, @joeddav. In case using the Trainer class in the NER task along with using a compute_metrics function

# def compute_metrics(pred):
#     labels = pred.label_ids
#     preds = pred.predictions.argmax(-1)

#     print(labels.shape == preds.shape)

#     precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
#     # acc = accuracy_score(labels, preds)
#     return {
#         'f1': f1,
#         'precision': precision,
#         'recall': recall
#     }

I always run through an error
ValueError: Classification metrics can't handle a mix of multiclass-multioutput and multilabel-indicator targets
solutions online said that the labels and predictions might not be the same shape, but this is not the case here.the rest of the code is exactly like the tutorial.

sgugger · November 19, 2020, 4:00pm

Hi there. You should use metrics designed for NER. For instance the package seqeval has some that will work directly. Check the run_ner script to see how it’s used in compute_metrics.

g3casey · April 8, 2021, 2:44am

Thanks for this tutorial @joeddav. I have reviewed your W-NUT example a few times.
I was wondering if you could point me to an similar example that demonstrates how to add new labels to the classification. For example, I would like to classify address information.

abercher · April 17, 2021, 12:54pm

Hello everyone,
I’m trying to reproduce the IMDB sentiment analysis model of the tutorial. I already had the data on my personal machine under a slight different form, but after transforming it into a list of strings, each one containing a review, it should be the same. But I get an error. I don’t know if I did something wrong or if the library changed since the tutorial was created.
My code looks like this (the script isn’t finished as it doesn’t contain the evaluation part but the training is already failing):

import os
import pandas as pd
import pickle
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments
import torch
from torch import cuda
from torch.utils.data import Dataset
import time
from transformers.integrations import TensorBoardCallback

from train_and_eval_lstm import print_evaluation_scores

device = 'cuda' if cuda.is_available() else 'cpu'


def main():
    clean_text_train_fn = os.path.join(os.getcwd(), "Transformed_data/clean_text_train.csv")
    df_clean_text_train = pd.read_csv(clean_text_train_fn)
    clean_text_train = df_clean_text_train["clean_text"].tolist()
    clean_text_valid_fn = os.path.join(os.getcwd(), "Transformed_data/clean_text_valid.csv")
    df_clean_text_valid = pd.read_csv(clean_text_valid_fn)
    clean_text_valid = df_clean_text_valid["clean_text"].tolist()
    clean_text_test_fn = os.path.join(os.getcwd(), "Transformed_data/clean_text_test.csv")
    df_clean_text_test = pd.read_csv(clean_text_test_fn)
    clean_text_test = df_clean_text_test["clean_text"].tolist()


    ## Load binary labels
    y_binary_train_fn = os.path.join(os.getcwd(), 'Transformed_data/Labels/y_binary_train.pkl')
    with open(y_binary_train_fn, mode='rb') as f:
        y_binary_train = pickle.load(f)
    y_binary_valid_fn = os.path.join(os.getcwd(), 'Transformed_data/Labels/y_binary_valid.pkl')
    with open(y_binary_valid_fn, mode='rb') as f:
        y_binary_valid = pickle.load(f)
    y_binary_test_fn = os.path.join(os.getcwd(), 'Transformed_data/Labels/y_binary_test.pkl')
    with open(y_binary_test_fn, mode='rb') as f:
        y_binary_test = pickle.load(f)

    ## Using pretrained Tokenizer
    model_name = 'distilbert-base-uncased'
    tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

    start = time.time()
    train_encodings = tokenizer(clean_text_train, truncation=True, padding=True)
    stop = time.time()
    print(f"Time to tokenize training set: {stop - start}")
    val_encodings = tokenizer(clean_text_valid, truncation=True, padding=True)
    test_encodings = tokenizer(clean_text_test, truncation=True, padding=True)

    class IMDbDataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels

        def __getitem__(self, idx):
            item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
            item['labels'] = torch.tensor(self.labels[idx])
            return item

        def __len__(self):
            return len(self.labels)

    n_toy = 500
    toy_dataset = IMDbDataset(train_encodings[:n_toy], y_binary_train[:n_toy])
    train_dataset = IMDbDataset(train_encodings, y_binary_train)
    val_dataset = IMDbDataset(val_encodings, y_binary_valid)
    test_dataset = IMDbDataset(test_encodings, y_binary_test)

    training_args = TrainingArguments(
        output_dir='./results',  # output directory
        num_train_epochs=1,  # total number of training epochs
        per_device_train_batch_size=16,  # batch size per device during training
        per_device_eval_batch_size=64,  # batch size for evaluation
        warmup_steps=500,  # number of warmup steps for learning rate scheduler
        weight_decay=0.01,  # strength of weight decay
        logging_dir='./logs',  # directory for storing logs
        logging_steps=10,
    )

    model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

    trainer = Trainer(
        model=model,  # the instantiated 🤗 Transformers model to be trained
        args=training_args,  # training arguments, defined above
        train_dataset=toy_dataset,  # training dataset
        eval_dataset=val_dataset,  # evaluation dataset
        callbacks=[TensorBoardCallback]
    )

    start = time.time()
    trainer.train()
    stop = time.time()

    print(f"Time to train the model: {stop-start}")

    model_dir = os.path.join(os.getcwd(), "Saved_models")
    model.save_pretrained(model_dir)


if __name__ == "__main__":
    main()

And if I execute it, I receive the following error:

  File "/home/me/Documents/CS_Programming_Machine_Learning/Projects/IMDB_sentiment_analysis_2/Comparison_models/train_and_eval_DistilBERT.py", line 61, in __getitem__
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
AttributeError: 'list' object has no attribute 'items'

If I use debugging, I see that indeed, self.encodings is a python list.

I guess that I can fix it on my own, but I was wondering if I did something wrong or if the docs are outdated (I use version 4.5.1 of HF Transformers).

sgugger · April 17, 2021, 1:28pm

The problem lies in your added line:

toy_dataset = IMDbDataset(train_encodings[:n_toy], y_binary_train[:n_toy])

The train_encodings is a dictionary (with some added properties that let you take a slice like this) so you should do some thing like

toy_encodings = {k: v[:n_toy] for k, v in train_encodings.items()}

to keep a dictionary.

abercher · April 17, 2021, 7:11pm

Thank you very much for your answer sgugger.
Your suggestion solved my problem!
I’m sorry I didn’t realize the source of this issue myself

Have a nice day!

dineshmane · July 7, 2022, 6:53am

@joeddav, could you please update the tutorial link? I’m getting 404 error. Thank you

akshay0710 · August 19, 2022, 12:03am

@joeddav Thank you for the tutorial. I was trying to replicate the finetuning code with a different dataset and it worked. But when I changed the pretrainedmodel from Distilbert to something else like Roberta or XlNet, I got an error in the encoding function.

This is the encoding function:

def encode_tags(tags, encodings):
    labels = [[tag2id[tag] for tag in doc] for doc in tags]
    encoded_labels = []
    for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
        # create an empty array of -100
        doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
        arr_offset = np.array(doc_offset)
       # set labels whose first offset position is 0 and the second is not 0
        doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
        encoded_labels.append(doc_enc_labels.tolist())

    return encoded_labels

It didn’t throw an error if I use BERT or DISTILBERT as the pretrained model and tokenizer, but if I use some other model in its place - This was the error that I got:

Traceback (most recent call last):
File “huggingFace_NER.py”, line 187, in
train_labels = encode_tags(train_tags, train_encodings)
File “huggingFace_NER.py”, line 70, in encode_tags
doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
ValueError: NumPy boolean array indexing assignment cannot assign 100 input values to the 80 output values where the mask is true

jaymojnidar · September 15, 2023, 5:34pm

The link “https://huggingface.co/transformers/master/custom_datasets.html” is giving me 404. Is the tutorial removed for some reason?

armagetiton · February 12, 2024, 5:12pm

I have the same question. Where did the tutorial go?

Topic		Replies	Views
Chapter 7 questions Course	119	10331	July 10, 2025
Chapter 3 questions Course	145	10336	July 15, 2025
Bert with Ner using python Beginners	0	152	November 2, 2023
Doccano dataset for named entity recognition task using BERT Beginners	3	478	May 14, 2024
Overall accuracy in Finetuning dslim/bert-base-NER with custom dataset and labels gets only up to ~0.15 using seqeval 🤗Transformers	2	513	May 1, 2023

Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering

Related topics