Chapter 3 questions

should the evaluation dataset have the exact labels contained in the training dataset? That is, is it possible for the evaluation to have all its labels contained in the training labels, but for the training not to have all its labels contained in the evaluation labels?

For example, training has A and B, whereas evaluation has only A.

Hi everyone, and thank you for all the great stuff in the Hugging Face Course. This is my first time posting so please bear with me. I am working my way through fine-tuning a sentiment analysis model and am getting an error that I cannot figure out. Hopefully someone can see what I am missing. I have included the code that I am trying to run below.

I am getting an error that reads ‘TypeError: len() of a 0-d tensor’ when trying to run a batch according to Chapter 3 ‘A Full Training’ Section. Any suggestions?

download model from hugging face transformers

from transformers import AutoModelForSequenceClassificatio
MODEL = f’cardiffnlp/twitter-roberta-base-sentiment’
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

def tokenizer_function(example):
return tokenizer(example[‘review’], truncation=True, padding=True, max_length=200,
return_tensors=‘pt’)

tokenized_datasets = ds.map(tokenizer_function, batched=True)

from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, padding=True)

import torch
device = torch.device(“cuda”) if torch.cuda.is_available() else torch.device(“cpu”)
model.to(device)

tokenized_datasets = tokenized_datasets.remove_columns([‘review’, ‘index’])
tokenized_datasets = tokenized_datasets.rename_column(‘sentiment’, ‘labels’)
tokenized_datasets.set_format(‘torch’)

from torch.utils.data import DataLoader
train_dataloader = DataLoader(
tokenized_datasets[‘train’], shuffle=True, batch_size=8, collate_fn=data_collator)

eval_dataloader = DataLoader(
tokenized_datasets[‘validation’], batch_size=8, collate_fn=data_collator)

Running this code triggers the TypeError message

for batch in train_dataloader:
break

{k: v.shape for k, v in batch.items()}

#TypeError: len() of a 0-d tensor

Hey, thanks for creating this course!
In the ‘A full training’ section, you use accellerate
from accelerate import Accelerator
But as far as I can see, this was not installed anywhere (unlike datasets, evaluate, transformers), so I think it would be a good idea to add that (or at least comment on it).

If I want to do link prediction task, does it mean that I need to pre-generate all the prediction nodes and load them into the dataloader?
(Generation format: {“text”: “box set is a subfield of computer hardware”, “label”: 1}, box set node to match each node, if there are too many matching nodes in the link prediction task(such as 40000) ,Then a sample node box set needs to be matched 40,000 times during link prediction, and the batch size is very large due to the data generated in advance. Can I generate link prediction statements using a dynamic template? But this kind of data can not be preprocessed, how can I solve it?

Hi, I am trying to create a new dataset repository through this cmd “huggingface-cli repo create your_dataset_name --type dataset” I get this error " {“error”:“You don’t have the rights to create a dataset under this namespace”}"

Is there something I am doing wrong?

Hi,
I want to fine-tune a model for QA, I have only documents (pdf files, txt), any way to prepair dataset from these files?

Thanks in advance
BR: Disho

1 Like

I’m running the script, and there seems to be some problem with the accelerator. Even after re-installing it, it is showing Partial-state is not defined. I’m not able to understand the error message.
Thanks!

I tried to execute the code present in notebook training which I found similar to Finetuning code. This is the error I am getting on executing the code in colab - [name ‘PartialState’ is not defined]. In multiple lines I am getting this error.

I am also getting the same message

I have a question regarding Trainer. In fine-tuning section of chapter 3 of the NLP course, one of the inputs of the Trainer is tokenizer.
However, in the tutorial, tokenizer is not used as input.
What is the difference then? What about using DataCollator?

In Chapter3 the third section “Finetune a model with the Traniner API or Keras”, Tensorflow application:
when we want to predict the exact labels for our evaluation dataset, the guide says

preds = model.predict(tf_validation_dataset)["logits"]
class_preds = np.argmax(preds, axis=1)
print(preds.shape, class_preds.shape)

I think it should add softmax function to the predictions other than using the argmax directly. Am I right? The same question about the code in pytorch code as well

@sgugger Thanks for the amazing course. I was trying to use DataCollatorWithPadding in the following code but wanted to check if i am on the right path?

!pip -q  install transformers datasets accelerate sentence-transformers  iterative-stratification umap-learn wandb hdbscan altair altair-data-server

import numpy as np 
import pandas as pd 
from tqdm import tqdm_notebook
import os, gc, shutil, re, warnings
import pickle
warnings.filterwarnings("ignore")
# set the max columns to none
pd.set_option('display.max_columns', None)

import random
SEED=75
random.seed(SEED)

import joblib
from sklearn.manifold import TSNE
from umap import UMAP 


from torch.nn.functional import normalize
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

import torch
import torch.nn as nn
import transformers
from transformers import (
    AutoModel, AutoConfig, 
    AutoTokenizer, logging,
    AdamW, get_linear_schedule_with_warmup,
    DataCollatorWithPadding,
    Trainer, TrainingArguments
)
from transformers.modeling_outputs import SequenceClassifierOutput

logging.set_verbosity_error()
logging.set_verbosity_warning()


import wandb

#### plots
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.colors
from IPython.core.display import display, HTML
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
### Plotly settings
temp=dict(layout=go.Layout(font=dict(family="Ubuntu", size=14), 
                           height=600, 
                           plot_bgcolor = "#ededed",
                  paper_bgcolor = "#ededed"))

# Load data from huggingface

from datasets import load_dataset

dataset = load_dataset("cdsi-nlp-workshops/arxiv_classification")

dataset

raw_train_dataset = dataset["train"]

text_list = raw_train_dataset['text']

raw_train_dataset.features

import re
import string

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove unwanted characters
    text = re.sub(r"\n", " ", text)  # Replace newline characters with space
    text = re.sub(r"\s+", " ", text)  # Replace multiple spaces with a single space
    text = text.strip()  # Remove leading/trailing whitespaces

    # Remove "abstract" part
    text = re.sub(r"^abstract\s+", "", text)

    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))


    return text


# Clean each "text" string
sentences = list(map(lambda text: preprocess_text(text), text_list))

print("Cleaned text:")
print(sentences)


from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Reduce batch size and sequence length if necessary
max_batch_size = 8
max_sequence_length = 128
num_sentences = len(sentences)

if num_sentences > max_batch_size:
    encoded_input = {key: value[:max_batch_size] for key, value in encoded_input.items()}

if encoded_input['input_ids'].size(1) > max_sequence_length:
    encoded_input = {key: value[:, :max_sequence_length] for key, value in encoded_input.items()}

# Compute token embeddings
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
for key, value in encoded_input.items():
    encoded_input[key] = value.to(device)

with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:")
print(sentence_embeddings)

Please advise on how i can incorporate within my code.
Looking forward to hearing from you
Thanks,
Andy

Switching Colab to a GPU caused several problems - first off, you need to include pip install Accelerate (and then you need to restart the runtime). Secondly, the GPU runtimes seemed much more flakey in reaching the datasets. I had constant timeouts etc trying to reach fbaidownloads (I’m guessing at this point, as I lost the error). These didn’t happen on either normal instances or TPU (though TPU’s dont work, due to the variable padding)

Hi, I tripped across this also as I tried to use the GPU enable " Fine-tuning a model with the Trainer API or Keras" notebook on colab. I followed the error message suggestions and even looked on stack overflow, was kind of stuck for 1/2 an hour or more…

It seems so simple to just add accelerate in the opening pip command:
!pip install datasets evaluate accelerate transformers[sentencepiece]

@course_moderators,
Any chance this could be added to the notebook in the near future?

And I agree with emifjo, this is a great course!

Thanks so much for the informational guides!

I have been following the Tensorflow version of the codes/guides, and was confused at the “Full Training” chapter because it was the PyTorch version/codes.

Please excuse my ignorance, I’m not sure how to connect the “Full training” chapter to the previous chapter. Can I ask:

  1. What are the advantages of writing a full training loop in Pytorch compared to using Keras in the previous chapter (“Fine-tuning a model”)?

  2. Is there a Tensorflow version for something similar to PyTorch’s Accelerate, and for distributed training on multiple GPUs? Or would you recommend that we just learn and use PyTorch instead?

Thank you

1 Like

Hi,

In the “Preprocessing the data” section, we are defining a tokenize function as below:

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

And then applying it to our dataset using map:

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Why does the tokenize function return attention_masks for the above code? Since we haven’t applied padding yet, there should not be any attention_masks, right?

Thanks.

1 Like

Why I Fine-tune a model on the GLUE SST-2 dataset but get worse score compare to Bert(base)

:pencil2: Try it out! Fine-tune a model on the GLUE SST-2 dataset, using the data processing you did in section 2.

And I made a new topic here, I would delete it if I must put it under this topic.

There either seems to be incorrectly stated information or I am missing something
In section 3 “Fine-tuning a model with the Trainer API”, the code example has us use “bert-base-uncased”. Both for PyTorch and TensorFlow.
But then later in the Evaluation subsection there is this passage:
“The table in the BERT paper reported an F1 score of 88.9 for the base model. That was the uncased model while we are currently using the cased model, which explains the better result.”
If we follow the example code, we too should be using the ‘uncased’ model, right?

Is it possible to fine tune a model without head and then used multiple heads on the same model? If so how can I do it?

How do get I generate code that goes in a training_function() to be used with accelerator in notebook ?