I am using the Fake news dataset that is used in this google colab notebook. with the goal of adapting this example. For full reproducability, I uploaded the exact files I am using for training and testing in a github repository here.
However it appeared that some of the classes and methods were deprecated so I was trying to re-do it using the notebook as a guide: IMDb Classification with Trainer.ipynb
I am getting error after running this train_dataset = ds_train.map(tokenize)
where you will find tokenize defined below along with the rest of the code. I copied and pasted the error message after the code. (see full error in comment)
In case anyone has further advice or comments I also added the rest of the code I am planning to run, which you will find after the error message.
Thank you for viewing this post and I appreciate any help you can offer.
from nlp import Dataset
import pandas as pd
from torch import tensor
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig, EvalPrediction
import torch
# read csv in pandas
df_train = pd.read_csv("~/Downloads/fakenewstrain.csv")
df_test = pd.read_csv("~/Downloads/fakenewstest.csv")
# convert pandas df (only columns 'titletext' and 'label') to nlp Dataset
ds_train = Dataset.from_pandas(df_train[['titletext','label']])
ds_test = Dataset.from_pandas(df_test[['titletext','label']])
# set up configuration, tokenizer and model
config = AutoConfig.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModelForSequenceClassification.from_config(config)
# function to tokenize a line of text using tokenizer
def tokenize(batch):
return tokenizer(batch['titletext'],
max_length = 64,
truncation = True,
padding = True,
return_tensors = "pt")
# loop through Dataset using Dataset map function for tokenization
train_dataset = ds_train.map(tokenize)
test_dataset = ds_test.map(tokenize)
Here is the error I am getting:
Full error is in the comment.
ArrowInvalid: Could not convert tensor([[ 101, 8499, 4642, 1106, 5307, 1614, 1166, 1114, 27157, 2101, 1656, 1733, 119, 4613, 117, 2631, 113, 13597, 114, 5554, 8499, 112, 188, 1207, 13715, 176, 12328, 1500, 3215, 1786, 1656, 1733, 1120, 170, 185, 14695, 8037, 1303, 1113, 9170, 1115, 1103, 26961, 1524, 118, 6057, 1110, 1231, 7867, 27885, 1103, 1226, 107, 1115, 1119, 112, 188, 1151, 1773, 107, 1105, 1110, 2407, 102]]) with type Tensor: did not recognize Python value type when inferring an Arrow data type
Here is the rest of my code (feel free to ignore as not relevant to exact question) I just figured I would add it in case anyone had any helpful comments:
I have actually been confused about how the labels are specified. From what I see, they are only referenced to when using set_format
however it looks like columns is actually just a list of column names, and I did not see anywhere in the documentation that implied that Trainer
specifically looks for certain columns.
# loop through Dataset using Dataset map function for tokenization
train_dataset = ds_train.map(tokenize)
test_dataset = ds_test.map(tokenize)
# Set format of Dataset, and specify columns to use
# (columns are "input_ids", "attention mask", "token_type_ids" and "label")
# Do I need attention mask since im not doing two sentences? Do I need token type ids?
train_dataset.set_format('torch', columns=['input_ids', 'token_type_ids', 'label'])
test_dataset.set_format('torch', columns=['input_ids', 'token_type_ids', 'label'])
def compute_metrics(p: EvalPrediction) -> dict():
preds = np.argmax(p.predictions, axis=1)
return glue_compute_metrics(data_args.task_name, preds, p.label_ids)
training_args = transformers.TrainingArguments(
output_dir="./Downloads/tmp/",
overwrite_output_dir=True,
do_train=True,
do_eval=True,
per_gpu_train_batch_size=16,
per_gpu_eval_batch_size=64,
num_train_epochs=1,
logging_steps=500,
logging_first_step=True,
save_steps=1000,
evaluate_during_training=True,
)
trainer = transformers.Trainer(model = model,
args = training_args,
train_dataset = text_train,
eval_dataset = text_test,
compute_metrics = compute_metrics)
trainer.train()