Working with ML task I use Trainer and load_dataset to fune-tune model on custom corpus. But during train the strange error is appeared. It seems to me that forum doesn’t allow to upload csv file. Below is my solution and info about used csv file:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from transformers import AutoTokenizer, AutoModel
import torch
from datasets import load_dataset
text = 'excerpt'
raw_datasets = load_dataset('csv', data_files='./corpus.csv')
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
return tokenizer(examples[text], padding="max_length", truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
from sklearn.model_selection import ShuffleSplit
rs = ShuffleSplit(n_splits=1, test_size=.25, random_state=0)
for train_index, test_index in rs.split(range(0, 2834)):
pass
small_train_dataset = tokenized_datasets["train"].select(train_index)
small_eval_dataset = tokenized_datasets["train"].select(test_index)
from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification
model = AutoModelForMaskedLM.from_pretrained("bert-base-cased")
from transformers import TrainingArguments
training_args = TrainingArguments("test_trainer")
from transformers import Trainer
trainer = Trainer(
model=model, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)
trainer.train()
# CSV FILE
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2834 entries, 0 to 2833
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 excerpt 2834 non-null object
dtypes: object(1)
memory usage: 22.3+ KB
Here excerpt feature contains text values. The debug output for the mistake the following: KeyError: ‘loss’. Please explain what is wrong.