Converting Input String to List (or Sequence) of Strings

Hello,

I am working on a Named Entity Recognition project. This is the data that I am working with is Named Entity Recognition (NER) Corpus | Kaggle.

When I try to map the tokenize_and_align_labels function, i get the following error: ArrowInvalid: Could not convert ‘[’ with type str: tried to convert to int64. I am pretty sure it has to do with all of the columns having a dtype of string.

That is okay for the sentence column, but for the two tag columns (POS & tag), they should be a list of strings (or maybe a sequence of strings).

How do I convert just those two columns to lists (or sequences) of strings?

Thanks,

Brian

P.S.- If you need any addition code to answer, let me know. This is my first post here!

You can convert those 2 columns to a list in a pre-processing stage after loading the csv file. I created a function below which works fine. You have to use ast.literal_eval to convert that string back into a list

def preprocess_data(df):
for i in range(len(df)):
pos = ast.literal_eval(df[‘POS’][i])
tags = ast.literal_eval(df[‘Tag’][i])
df[‘POS’][i] = [str(word) for word in pos]
df[‘Tag’][i] = [str(word.upper()) for word in tags]
return df