Chapter 3 questions

@sgugger Thanks for the amazing course. I was trying to use DataCollatorWithPadding in the following code but wanted to check if i am on the right path?

!pip -q  install transformers datasets accelerate sentence-transformers  iterative-stratification umap-learn wandb hdbscan altair altair-data-server

import numpy as np 
import pandas as pd 
from tqdm import tqdm_notebook
import os, gc, shutil, re, warnings
import pickle
warnings.filterwarnings("ignore")
# set the max columns to none
pd.set_option('display.max_columns', None)

import random
SEED=75
random.seed(SEED)

import joblib
from sklearn.manifold import TSNE
from umap import UMAP 


from torch.nn.functional import normalize
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

import torch
import torch.nn as nn
import transformers
from transformers import (
    AutoModel, AutoConfig, 
    AutoTokenizer, logging,
    AdamW, get_linear_schedule_with_warmup,
    DataCollatorWithPadding,
    Trainer, TrainingArguments
)
from transformers.modeling_outputs import SequenceClassifierOutput

logging.set_verbosity_error()
logging.set_verbosity_warning()


import wandb

#### plots
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.colors
from IPython.core.display import display, HTML
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
### Plotly settings
temp=dict(layout=go.Layout(font=dict(family="Ubuntu", size=14), 
                           height=600, 
                           plot_bgcolor = "#ededed",
                  paper_bgcolor = "#ededed"))

# Load data from huggingface

from datasets import load_dataset

dataset = load_dataset("cdsi-nlp-workshops/arxiv_classification")

dataset

raw_train_dataset = dataset["train"]

text_list = raw_train_dataset['text']

raw_train_dataset.features

import re
import string

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove unwanted characters
    text = re.sub(r"\n", " ", text)  # Replace newline characters with space
    text = re.sub(r"\s+", " ", text)  # Replace multiple spaces with a single space
    text = text.strip()  # Remove leading/trailing whitespaces

    # Remove "abstract" part
    text = re.sub(r"^abstract\s+", "", text)

    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))


    return text


# Clean each "text" string
sentences = list(map(lambda text: preprocess_text(text), text_list))

print("Cleaned text:")
print(sentences)


from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Reduce batch size and sequence length if necessary
max_batch_size = 8
max_sequence_length = 128
num_sentences = len(sentences)

if num_sentences > max_batch_size:
    encoded_input = {key: value[:max_batch_size] for key, value in encoded_input.items()}

if encoded_input['input_ids'].size(1) > max_sequence_length:
    encoded_input = {key: value[:, :max_sequence_length] for key, value in encoded_input.items()}

# Compute token embeddings
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
for key, value in encoded_input.items():
    encoded_input[key] = value.to(device)

with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:")
print(sentence_embeddings)

Please advise on how i can incorporate within my code.
Looking forward to hearing from you
Thanks,
Andy

Switching Colab to a GPU caused several problems - first off, you need to include pip install Accelerate (and then you need to restart the runtime). Secondly, the GPU runtimes seemed much more flakey in reaching the datasets. I had constant timeouts etc trying to reach fbaidownloads (I’m guessing at this point, as I lost the error). These didn’t happen on either normal instances or TPU (though TPU’s dont work, due to the variable padding)

Hi, I tripped across this also as I tried to use the GPU enable " Fine-tuning a model with the Trainer API or Keras" notebook on colab. I followed the error message suggestions and even looked on stack overflow, was kind of stuck for 1/2 an hour or more…

It seems so simple to just add accelerate in the opening pip command:
!pip install datasets evaluate accelerate transformers[sentencepiece]

@course_moderators,
Any chance this could be added to the notebook in the near future?

And I agree with emifjo, this is a great course!

Thanks so much for the informational guides!

I have been following the Tensorflow version of the codes/guides, and was confused at the “Full Training” chapter because it was the PyTorch version/codes.

Please excuse my ignorance, I’m not sure how to connect the “Full training” chapter to the previous chapter. Can I ask:

  1. What are the advantages of writing a full training loop in Pytorch compared to using Keras in the previous chapter (“Fine-tuning a model”)?

  2. Is there a Tensorflow version for something similar to PyTorch’s Accelerate, and for distributed training on multiple GPUs? Or would you recommend that we just learn and use PyTorch instead?

Thank you

1 Like

Hi,

In the “Preprocessing the data” section, we are defining a tokenize function as below:

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

And then applying it to our dataset using map:

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Why does the tokenize function return attention_masks for the above code? Since we haven’t applied padding yet, there should not be any attention_masks, right?

Thanks.

1 Like

Why I Fine-tune a model on the GLUE SST-2 dataset but get worse score compare to Bert(base)

:pencil2: Try it out! Fine-tune a model on the GLUE SST-2 dataset, using the data processing you did in section 2.

And I made a new topic here, I would delete it if I must put it under this topic.

There either seems to be incorrectly stated information or I am missing something
In section 3 “Fine-tuning a model with the Trainer API”, the code example has us use “bert-base-uncased”. Both for PyTorch and TensorFlow.
But then later in the Evaluation subsection there is this passage:
“The table in the BERT paper reported an F1 score of 88.9 for the base model. That was the uncased model while we are currently using the cased model, which explains the better result.”
If we follow the example code, we too should be using the ‘uncased’ model, right?

Is it possible to fine tune a model without head and then used multiple heads on the same model? If so how can I do it?

How do get I generate code that goes in a training_function() to be used with accelerator in notebook ?

In the Finetuning chapter, the article on Full-training isn’t available on Tensorflow.

2 Likes

:pencil2: Try it out! Replicate the preprocessing on the GLUE SST-2 dataset. It’s a little bit different since it’s composed of single sentences instead of pairs, but the rest of what we did should look the same. For a harder challenge, try to write a preprocessing function that works on any of the GLUE tasks.
How many types do the GLUE tasks have? single sentences/ pairs/three?

from transformers import TrainingArguments

training_args = TrainingArguments(“test-trainer”)

ImportError Traceback (most recent call last)
in <cell line: 3>()
1 from transformers import TrainingArguments
2
----> 3 training_args = TrainingArguments(“test-trainer”)

4 frames
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py in _setup_devices(self)
1770 if not is_sagemaker_mp_enabled():
1771 if not is_accelerate_available(min_version = “0.20.1”):
→ 1772 raise ImportError(
1773 “Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U
1774 )

ImportError: Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U


NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
“Open Examples” button below.

Hi All,

I want to fine tune a summarization model on a custom dataset. Are there any guidelines around how much data I would need, will data from a different domain help, etc.?

I am trying to summarize conversations. In most cases, these conversations will involve just two people. I finetuned google/flan-t5-base and facebook/bart-large-cnn on about 1000 examples, results are good but not as good as GPT-3.5.

Do I need to gather and train on more data? If I don’t have access to data for my use case, can I use data from any other domain as long as they are conversations? Say, from podcasts?

For how long do I train the model for? Are there any best practices around choosing number of epochs, etc.?

I am looking to improve the performance of my model and can really use some help! I have looked online but can’t find a clear answer. I understand that in a lot of cases, you need to experiment what works for you but there are so many possibilities and I am looking for a starting point, as a beginner in this field.

Thank you for your help!

When i run the trainer.train(), it comes with the following error:
TypeError: ‘NoneType’ object is not callable.

The attention_masks returned by tokenize_function will be equal to the sequence length passed. Also, it will all be 1’s. DataCollatorWithPadding will add 0’s to the attention_mask based on the longest sequence in the batch.

  • based on the example in the chapter.
1 Like

I am working on the last ‘Try out!’ in Chapter 3 section ‘Fine-tuning a model with the Trainer API’. Everything goes fine until the code line ‘trainer.train()’ as it is showed in the image.

Please help me to solve this error.

1 Like

In Chapter 3, section '‘Fine-tuning a model with the Trainer API’. ’ when I run into the following error when I istantiate the TrainingArguments Class.
The accelerate module is already installed in the requested version.

The issue happens in the linked colab exercise notebook.

Any idea how to fix it?


---------------------------------------------------------------------------

ImportError                               Traceback (most recent call last)

<ipython-input-3-11170ce17e38> in <cell line: 3>()
      1 from transformers import TrainingArguments
      2 
----> 3 training_args = TrainingArguments("test-trainer")

4 frames

/usr/local/lib/python3.10/dist-packages/transformers/training_args.py in _setup_devices(self)
   1799         if not is_sagemaker_mp_enabled():
   1800             if not is_accelerate_available(min_version="0.20.1"):
-> 1801                 raise ImportError(
   1802                     "Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`"
   1803                 )

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`


---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
1 Like

Have you finish this? :mask:

Hi! I am not understanding why we should tokenize the dataset using map and also using it as the trainer argument tokenizer. What is the behavior of the tokenizer argument in the Trainer?

I am trying to do the transformers course, and am running into trouble in the lesson “Fine-tuning a model with the Trainer API.” I am running it on a free Google Colab instance with a T4 GPU. All of the provided code works (I had to add a !pip install transformers torch after the !pip install datasets evaluate transformers[sentencepiece] to make it work), but at the bottom, we were told to " Fine-tune a model on the GLUE SST-2 dataset, using the data processing you did in section 2." Here, I ran into a strange error.

Below is the code I was trying to use for this exercise. Each code block is put into its own block here.

single_sentence_dataset = load_dataset("glue", "sst2")
single_sentence_dataset
single_sentence_dataset['train'].features
single_sentence_dataset['train'][0]
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_single_sentence_function(example):
    return tokenizer(example["sentence"], truncation=True)


tokenized_single_sentence_datasets = single_sentence_dataset.map(tokenize_single_sentence_function, batched=True)
single_sentence_data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
tokenized_single_sentence_datasets['train'][0]
single_sentence_training_args = TrainingArguments("sst2-trainer", evaluation_strategy="epoch")
from transformers import AutoModelForSequenceClassification

single_sentence_model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
def compute_metrics_single_sentence(eval_preds):
    metric = evaluate.load("glue", "sst2")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
    single_sentence_model,
    single_sentence_training_args,
    train_dataset=tokenized_single_sentence_datasets["train"],
    eval_dataset=tokenized_single_sentence_datasets["validation"],
    data_collator=single_sentence_data_collator,
    tokenizer=tokenize_single_sentence_function,
    compute_metrics=compute_metrics_single_sentence
)
trainer.train()

Everything runs, except for trainer.train(). When I call that, it gets to [ 501/25257 00:46 < 38:17, 10.77 it/s, Epoch 0.06/3] (what is the 25257 from, anyways? There are 67349 training examples here) before crashing with the following error:

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval)
   2280 
   2281         if self.control.should_save:
-> 2282             self._save_checkpoint(model, trial, metrics=metrics)
   2283             self.control = self.callback_handler.on_save(self.args, self.state, self.control)
   2284 

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in _save_checkpoint(self, model, trial, metrics)
   2348         run_dir = self._get_output_dir(trial=trial)
   2349         output_dir = os.path.join(run_dir, checkpoint_folder)
-> 2350         self.save_model(output_dir, _internal_call=True)
   2351         if self.is_deepspeed_enabled:
   2352             # under zero3 model file itself doesn't get saved since it's bogus! Unless deepspeed

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in save_model(self, output_dir, _internal_call)
   2841 
   2842         elif self.args.should_save:
-> 2843             self._save(output_dir)
   2844 
   2845         # Push to the Hub when `save_model` is called by the user.

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in _save(self, output_dir, state_dict)
   2904 
   2905         if self.tokenizer is not None:
-> 2906             self.tokenizer.save_pretrained(output_dir)
   2907 
   2908         # Good practice: save your training arguments together with the trained model

AttributeError: 'function' object has no attribute 'save_pretrained'

Removing the compute_metrics argument does not change anything.

Does anyone know what is going on? I am not explicitly telling it to save anything. Why is this failing when the provided code works?

Thank you!

Edit:

I was able to train a model for SST2 successfully without using the trainer API in the next lesson. Why does it not work here?