Hi,
I see you are first working with a HuggingFace Dataset (that is returned by the load_dataset
function), and that you are then converting it to a PyTorch Dataset.
Actually, the latter is not required. Also, you can tokenize your training and test splits in one go:
from transformers import BertTokenizer, BertForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
import torch
# load local data as HuggingFace Dataset
dataset = load_dataset('json', data_files={'train': 'train.jsonl', 'test': 'test.jsonl'})
def preprocess_data(examples):
# encode a batch of sentences
encoding = tokenizer(examples["sentence1"], padding="max_length", truncation=True)
# add labels as a list
encoding["labels"] = examples["label"]
return encoding
# tokenize sentences + add labels
encoded_dataset = dataset.map(preprocess_data)
# turn into PyTorch dataset
encoded_dataset.set_format("torch")
training_args = TrainingArguments("test_trainer")
trainer = Trainer(
model=model, args=training_args, train_dataset=encoded_dataset["train"], eval_dataset=encoded_dataset["test"])
trainer.train()