Loading WiC dataset for fine tuning

Hi,

I see you are first working with a HuggingFace Dataset (that is returned by the load_dataset function), and that you are then converting it to a PyTorch Dataset.

Actually, the latter is not required. Also, you can tokenize your training and test splits in one go:

from transformers import BertTokenizer, BertForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
import torch

# load local data as HuggingFace Dataset
dataset = load_dataset('json', data_files={'train': 'train.jsonl', 'test': 'test.jsonl'})

def preprocess_data(examples):
     # encode a batch of sentences
     encoding = tokenizer(examples["sentence1"], padding="max_length", truncation=True)
     # add labels as a list
     encoding["labels"] = examples["label"]

     return encoding

# tokenize sentences + add labels
encoded_dataset = dataset.map(preprocess_data)
# turn into PyTorch dataset
encoded_dataset.set_format("torch")

training_args = TrainingArguments("test_trainer")
trainer = Trainer(
    model=model, args=training_args, train_dataset=encoded_dataset["train"], eval_dataset=encoded_dataset["test"])

trainer.train()
1 Like