I think your problem has to do with the fact that your “label” field in the JSONL is a list of lists, where the elements in the inner lists are (sometimes) a mix of strings and integers.
For example, the first line of the file has the following label: [[0, 7, 0], [8, 11, 0], [12, 23, "B-LOC"],...]
, which is problematic because it combines numbers with strings like “B-LOC”.
If you really need those numbers you can simply map them to strings, so that your labels are a list of lists of strings. For instance:
import pandas as pd
from datasets import Dataset
def int2str(labels):
fixed_labels = []
for label_array in labels:
curr_array = []
for label in label_array:
if not isinstance(label,str):
label = str(label)
curr_array.append(label)
fixed_labels.append(curr_array)
return fixed_labels
df = pd.read_json("test_dataset.jsonl", lines=True)
df["label"] = df.apply(lambda x: int2str(x.label), axis=1)
dataset = Dataset.from_pandas(df)
print(dataset)