Why do I get this error running tokenizer?

I am using the Fake news dataset that is used in this google colab notebook. with the goal of adapting this example. For full reproducability, I uploaded the exact files I am using for training and testing in a github repository here.
However it appeared that some of the classes and methods were deprecated so I was trying to re-do it using the notebook as a guide: IMDb Classification with Trainer.ipynb

I am getting error after running this train_dataset = ds_train.map(tokenize) where you will find tokenize defined below along with the rest of the code. I copied and pasted the error message after the code. (see full error in comment)

In case anyone has further advice or comments I also added the rest of the code I am planning to run, which you will find after the error message.

Thank you for viewing this post and I appreciate any help you can offer.

from nlp import Dataset
import pandas as pd
from torch import tensor
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig, EvalPrediction
import torch

# read csv in pandas
df_train = pd.read_csv("~/Downloads/fakenewstrain.csv")
df_test = pd.read_csv("~/Downloads/fakenewstest.csv")

# convert pandas df (only columns 'titletext' and 'label') to nlp Dataset
ds_train = Dataset.from_pandas(df_train[['titletext','label']])
ds_test = Dataset.from_pandas(df_test[['titletext','label']])

# set up configuration, tokenizer and model
config = AutoConfig.from_pretrained('bert-base-uncased')    
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModelForSequenceClassification.from_config(config)

# function to tokenize a line of text using tokenizer
def tokenize(batch):
    return tokenizer(batch['titletext'], 
                     max_length = 64, 
                     truncation = True,
                     padding = True, 
                     return_tensors = "pt")

# loop through Dataset using Dataset map function for tokenization
train_dataset = ds_train.map(tokenize)
test_dataset = ds_test.map(tokenize)

Here is the error I am getting:
Full error is in the comment.

ArrowInvalid: Could not convert tensor([[ 101, 8499, 4642, 1106, 5307, 1614, 1166, 1114, 27157, 2101, 1656, 1733, 119, 4613, 117, 2631, 113, 13597, 114, 5554, 8499, 112, 188, 1207, 13715, 176, 12328, 1500, 3215, 1786, 1656, 1733, 1120, 170, 185, 14695, 8037, 1303, 1113, 9170, 1115, 1103, 26961, 1524, 118, 6057, 1110, 1231, 7867, 27885, 1103, 1226, 107, 1115, 1119, 112, 188, 1151, 1773, 107, 1105, 1110, 2407, 102]]) with type Tensor: did not recognize Python value type when inferring an Arrow data type

Here is the rest of my code (feel free to ignore as not relevant to exact question) I just figured I would add it in case anyone had any helpful comments:
I have actually been confused about how the labels are specified. From what I see, they are only referenced to when using set_format however it looks like columns is actually just a list of column names, and I did not see anywhere in the documentation that implied that Trainer specifically looks for certain columns.

# loop through Dataset using Dataset map function for tokenization
train_dataset = ds_train.map(tokenize)
test_dataset = ds_test.map(tokenize)

# Set format of Dataset, and specify columns to use 
# (columns are "input_ids", "attention mask", "token_type_ids" and "label")
# Do I need attention mask since im not doing two sentences? Do I need token type ids?
train_dataset.set_format('torch', columns=['input_ids', 'token_type_ids', 'label'])
test_dataset.set_format('torch', columns=['input_ids', 'token_type_ids', 'label'])

def compute_metrics(p: EvalPrediction) -> dict():
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

training_args = transformers.TrainingArguments(
    output_dir="./Downloads/tmp/",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    per_gpu_train_batch_size=16,
    per_gpu_eval_batch_size=64,
    num_train_epochs=1,
    logging_steps=500,
    logging_first_step=True,
    save_steps=1000,
    evaluate_during_training=True,
)

trainer = transformers.Trainer(model = model, 
                  args = training_args, 
                  train_dataset = text_train, 
                  eval_dataset = text_test,
                  compute_metrics = compute_metrics)

trainer.train()
1 Like

here is the full error:

---------------------------------------------------------------------------
ArrowInvalid                              
Traceback (most recent call last)
<ipython-input-2-ee0b34ea20c7> in <module>
      1 # loop through Dataset using Dataset map function for tokenization
----> 2 train_dataset = ds_train.map(tokenize)

/opt/anaconda3/lib/python3.8/site-packages/nlp/arrow_dataset.py in map(self, function, with_indices, batched, batch_size, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, verbose)
    942                     example = apply_function_on_filtered_inputs(example, i)
    943                     if update_data:
--> 944                         writer.write(example)
    945             else:
    946                 for i in tqdm(range(0, len(self), batch_size), disable=not verbose):

/opt/anaconda3/lib/python3.8/site-packages/nlp/arrow_writer.py in write(self, example, writer_batch_size)
    175             writer_batch_size = self.writer_batch_size
    176         if writer_batch_size is not None and len(self.current_rows) >= writer_batch_size:
--> 177             self.write_on_file()
    178 
    179     def write_batch(

/opt/anaconda3/lib/python3.8/site-packages/nlp/arrow_writer.py in write_on_file(self)
    139         type = None if self.update_features and self.pa_writer is None else self._type
    140         if self.current_rows:
--> 141             pa_array = pa.array(self.current_rows, type=type)
    142             first_example = pa.array(self.current_rows[0:1], type=type)[0]
    143             # Sanity check

/opt/anaconda3/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/opt/anaconda3/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

/opt/anaconda3/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Could not convert tensor([[  101,  8499,  4642,  1106,  5307,  1614,  1166,  1114, 27157,  2101,
          1656,  1733,   119,  4613,   117,  2631,   113, 13597,   114,  5554,
          8499,   112,   188,  1207, 13715,   176, 12328,  1500,  3215,  1786,
          1656,  1733,  1120,   170,   185, 14695,  8037,  1303,  1113,  9170,
          1115,  1103, 26961,  1524,   118,  6057,  1110,  1231,  7867, 27885,
          1103,  1226,   107,  1115,  1119,   112,   188,  1151,  1773,   107,
          1105,  1110,  2407,   102]]) with type Tensor: did not recognize Python value type when inferring an Arrow data type

For the labels, Trainer does nothing. It’s the model that expects an argument named labels to compute the loss. Your preprocessing should create that field if it does not exist. The only thing Trainer does is that the default data_collator will rename label and label_ids to labels.

(Asked for someone more familiar with nlp to help with your error.)

1 Like

In nlp 0.4.0 You can’t use a map function that returns pytorch tensors. You can fix it my removing the “return_tensors=‘pt’” in your tokenize function. This is because the dataset is saved in Arrow format.

It’s only recently that we’ve added the support for torch/tf tensors inputs (see changes here) and it will be will be available in the next release.

2 Likes

Thank you very much for your response! That is helpful.

Your config refers to bert-base-uncased, and your model to bert-base-cased - shouldn’t they be the same?

@rgwatwormhill oops! thanks. you are right. Not sure why I had it like that.