I am using the Fake news dataset that is used in this google colab notebook. with the goal of adapting this example. For full reproducability, I uploaded the exact files I am using for training and testing in a github repository here.
However it appeared that some of the classes and methods were deprecated so I was trying to re-do it using the notebook as a guide: IMDb Classification with Trainer.ipynb
I am getting error after running this train_dataset = ds_train.map(tokenize) where you will find tokenize defined below along with the rest of the code. I copied and pasted the error message after the code. (see full error in comment)
In case anyone has further advice or comments I also added the rest of the code I am planning to run, which you will find after the error message.
Thank you for viewing this post and I appreciate any help you can offer.
from nlp import Dataset
import pandas as pd
from torch import tensor
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig, EvalPrediction
import torch
# read csv in pandas
df_train = pd.read_csv("~/Downloads/fakenewstrain.csv")
df_test = pd.read_csv("~/Downloads/fakenewstest.csv")
# convert pandas df (only columns 'titletext' and 'label') to nlp Dataset
ds_train = Dataset.from_pandas(df_train[['titletext','label']])
ds_test = Dataset.from_pandas(df_test[['titletext','label']])
# set up configuration, tokenizer and model
config = AutoConfig.from_pretrained('bert-base-uncased')    
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModelForSequenceClassification.from_config(config)
# function to tokenize a line of text using tokenizer
def tokenize(batch):
    return tokenizer(batch['titletext'], 
                     max_length = 64, 
                     truncation = True,
                     padding = True, 
                     return_tensors = "pt")
# loop through Dataset using Dataset map function for tokenization
train_dataset = ds_train.map(tokenize)
test_dataset = ds_test.map(tokenize)
Here is the error I am getting:
Full error is in the comment.
ArrowInvalid: Could not convert tensor([[  101,  8499,  4642,  1106,  5307,  1614,  1166,  1114, 27157,  2101,           1656,  1733,   119,  4613,   117,  2631,   113, 13597,   114,  5554,           8499,   112,   188,  1207, 13715,   176, 12328,  1500,  3215,  1786,           1656,  1733,  1120,   170,   185, 14695,  8037,  1303,  1113,  9170,           1115,  1103, 26961,  1524,   118,  6057,  1110,  1231,  7867, 27885,           1103,  1226,   107,  1115,  1119,   112,   188,  1151,  1773,   107,           1105,  1110,  2407,   102]]) with type Tensor: did not recognize Python value type when inferring an Arrow data type
Here is the rest of my code (feel free to ignore as not relevant to exact question) I just figured I would add it in case anyone had any helpful comments:
I have actually been confused about how the labels are specified. From what I see, they are only referenced to when using set_format however it looks like columns is actually just a list of column names, and I did not see anywhere in the documentation that implied that Trainer specifically looks for certain columns.
# loop through Dataset using Dataset map function for tokenization
train_dataset = ds_train.map(tokenize)
test_dataset = ds_test.map(tokenize)
# Set format of Dataset, and specify columns to use 
# (columns are "input_ids", "attention mask", "token_type_ids" and "label")
# Do I need attention mask since im not doing two sentences? Do I need token type ids?
train_dataset.set_format('torch', columns=['input_ids', 'token_type_ids', 'label'])
test_dataset.set_format('torch', columns=['input_ids', 'token_type_ids', 'label'])
def compute_metrics(p: EvalPrediction) -> dict():
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)
training_args = transformers.TrainingArguments(
    output_dir="./Downloads/tmp/",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    per_gpu_train_batch_size=16,
    per_gpu_eval_batch_size=64,
    num_train_epochs=1,
    logging_steps=500,
    logging_first_step=True,
    save_steps=1000,
    evaluate_during_training=True,
)
trainer = transformers.Trainer(model = model, 
                  args = training_args, 
                  train_dataset = text_train, 
                  eval_dataset = text_test,
                  compute_metrics = compute_metrics)
trainer.train()