Sequence features - Class Label Cast_

Hello,

I am having trouble with the ClassLabel features for Token Classification. I am working via Pandas data frame for my dataset. And I am loading the data frame with the dataset. I cannot see the 9 custom IOB labels inside ClassLabel.

df = pd.DataFrame(df)
dataset = Dataset.from_pandas(df)
dataset = dataset.train_test_split(test_size=0.1)

Output:
DatasetDict({
    train: Dataset({
        features: ['tokens', 'labels', 'id'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'labels', 'id'],
        num_rows: 1000
    })
})

Output:
{'tokens': Value(dtype='string', id=None),
 **'labels': Value(dtype='string', id=None),**
 'id': Value(dtype='int64', id=None)}

I already tried the “cast” method → dataset.cast_column(“labels”
here … ClassLabel Error · Issue #5737 · huggingface/datasets · GitHub

And the “new_features” in the package reference.
here… Main classes

@mariosasko ? :innocent:

Thank you guys!

Try this:

df = pd.DataFrame(df)
dataset = Dataset.from_pandas(df)
dataset = dataset.class_encode_column("labels")
dataset = dataset.train_test_split(test_size=0.1)
1 Like

Thank you for your help @mariosasko

It’s correctly creating a ClassLabel(names=
Unfortunately, it’s appending all the labels on each row…

Column df[“labels”] with many rows as follow [‘O,O,B-DRUG,O,B-HOSPITAL,O’]

With encode_column, my results is a very long ClassLabel(names= [‘O,O,B-DRUG,O,B-HOSPITAL,O,B-HOSPITAL,O,B-DATE,O,O,O’]

Maybe the labels shouldn’t be a Pandas Series?

Oh, I missed the “Token Classification” part.

Then this should work:

df = pd.DataFrame(df)
dataset = Dataset.from_pandas(df)
dataset = dataset.map(lambda ex: {"labels": ex["labels"].split(",")})

def get_label_list(labels):
    # copied from https://github.com/huggingface/transformers/blob/66fd3a8d626a32989f4569260db32785c6cbf42a/examples/pytorch/token-classification/run_ner.py#L320
    unique_labels = set()
    for label in labels:
        unique_labels = unique_labels | set(label)
    label_list = list(unique_labels)
    label_list.sort()
    return label_list

all_labels = get_label_list(dataset["labels"])

dataset = dataset.cast_column("labels", datasets.Sequence(datasets.ClassLable(names=all_labels)))
dataset = dataset.train_test_split(test_size=0.1)

Perfect! Thank you @mariosasko

Have a good evening!

Sorry to bother you again @mariosasko

I am trying to create a “ner_tags” column with integers corresponding to “labels” to follow the HF tutorial on Token Classification.

I tried that…

tags = dataset.features[f"labels"].feature.names
print(tags)

def create_tag_names(batch):
    return {"ner_tags_str": [tags.str2int(idx) for idx in batch["labels"]]}
dataset = dataset.map(create_tag_names) 

Any ideas? Thx!

I am trying to create a “ner_tags” column with integers corresponding to “labels” to follow the HF tutorial on Token Classification.

Can you provide a link to the tutorial? We already store class labels as integers, so I’m not sure I understand what you want to do.

Oops, sorry!

The tutorial is working with the column ner_tags made of numbers mapping to the corresponding labels…
Thx!

You can rename the labels column to ner_tags to have the same structure:

dataset = dataset.rename_column("labels", "ner_tags")

And apply the rest of the processing.

@mariosasko Thank you for your help!

I found the solution to my stupid bug on “Tokens”. I transformed my sentences in “Tokens” into a list of tokens by adding the following line:

dataset = dataset.map(lambda ex: {“tokens”: ex[“tokens”].split(“,”)})

Thx again! : )