Problem with loading custom dataset from jsonl file

I think your problem has to do with the fact that your “label” field in the JSONL is a list of lists, where the elements in the inner lists are (sometimes) a mix of strings and integers.

For example, the first line of the file has the following label: [[0, 7, 0], [8, 11, 0], [12, 23, "B-LOC"],...], which is problematic because it combines numbers with strings like “B-LOC”.

If you really need those numbers you can simply map them to strings, so that your labels are a list of lists of strings. For instance:

import pandas as pd
from datasets import Dataset

def int2str(labels):
    fixed_labels = []
    for label_array in labels:
        curr_array = []
        for label in label_array:
            if not isinstance(label,str):
                label = str(label)
            curr_array.append(label)
        fixed_labels.append(curr_array)
    return fixed_labels

df = pd.read_json("test_dataset.jsonl", lines=True)
df["label"] = df.apply(lambda x: int2str(x.label), axis=1)
dataset = Dataset.from_pandas(df)
print(dataset)
2 Likes