I am using the Fake news dataset that is used in this google colab notebook. with the goal of adapting this example. For full reproducability, I uploaded the exact files I am using for training and testing in a github repository here.
However it appeared that some of the classes and methods were deprecated so I was trying to re-do it using the notebook as a guide: IMDb Classification with Trainer.ipynb
I am getting error after running this
train_dataset = ds_train.map(tokenize) where you will find tokenize defined below along with the rest of the code. I copied and pasted the error message after the code. (see full error in comment)
In case anyone has further advice or comments I also added the rest of the code I am planning to run, which you will find after the error message.
Thank you for viewing this post and I appreciate any help you can offer.
from nlp import Dataset import pandas as pd from torch import tensor from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig, EvalPrediction import torch # read csv in pandas df_train = pd.read_csv("~/Downloads/fakenewstrain.csv") df_test = pd.read_csv("~/Downloads/fakenewstest.csv") # convert pandas df (only columns 'titletext' and 'label') to nlp Dataset ds_train = Dataset.from_pandas(df_train[['titletext','label']]) ds_test = Dataset.from_pandas(df_test[['titletext','label']]) # set up configuration, tokenizer and model config = AutoConfig.from_pretrained('bert-base-uncased') tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') model = AutoModelForSequenceClassification.from_config(config) # function to tokenize a line of text using tokenizer def tokenize(batch): return tokenizer(batch['titletext'], max_length = 64, truncation = True, padding = True, return_tensors = "pt") # loop through Dataset using Dataset map function for tokenization train_dataset = ds_train.map(tokenize) test_dataset = ds_test.map(tokenize)
Here is the error I am getting:
Full error is in the comment.
ArrowInvalid: Could not convert tensor([[ 101, 8499, 4642, 1106, 5307, 1614, 1166, 1114, 27157, 2101, 1656, 1733, 119, 4613, 117, 2631, 113, 13597, 114, 5554, 8499, 112, 188, 1207, 13715, 176, 12328, 1500, 3215, 1786, 1656, 1733, 1120, 170, 185, 14695, 8037, 1303, 1113, 9170, 1115, 1103, 26961, 1524, 118, 6057, 1110, 1231, 7867, 27885, 1103, 1226, 107, 1115, 1119, 112, 188, 1151, 1773, 107, 1105, 1110, 2407, 102]]) with type Tensor: did not recognize Python value type when inferring an Arrow data type
Here is the rest of my code (feel free to ignore as not relevant to exact question) I just figured I would add it in case anyone had any helpful comments:
I have actually been confused about how the labels are specified. From what I see, they are only referenced to when using
set_format however it looks like columns is actually just a list of column names, and I did not see anywhere in the documentation that implied that
Trainer specifically looks for certain columns.
# loop through Dataset using Dataset map function for tokenization train_dataset = ds_train.map(tokenize) test_dataset = ds_test.map(tokenize) # Set format of Dataset, and specify columns to use # (columns are "input_ids", "attention mask", "token_type_ids" and "label") # Do I need attention mask since im not doing two sentences? Do I need token type ids? train_dataset.set_format('torch', columns=['input_ids', 'token_type_ids', 'label']) test_dataset.set_format('torch', columns=['input_ids', 'token_type_ids', 'label']) def compute_metrics(p: EvalPrediction) -> dict(): preds = np.argmax(p.predictions, axis=1) return glue_compute_metrics(data_args.task_name, preds, p.label_ids) training_args = transformers.TrainingArguments( output_dir="./Downloads/tmp/", overwrite_output_dir=True, do_train=True, do_eval=True, per_gpu_train_batch_size=16, per_gpu_eval_batch_size=64, num_train_epochs=1, logging_steps=500, logging_first_step=True, save_steps=1000, evaluate_during_training=True, ) trainer = transformers.Trainer(model = model, args = training_args, train_dataset = text_train, eval_dataset = text_test, compute_metrics = compute_metrics) trainer.train()