Sequence features - Class Label Cast_

R0m4ntic · June 26, 2023, 6:48pm

Hello,

I am having trouble with the ClassLabel features for Token Classification. I am working via Pandas data frame for my dataset. And I am loading the data frame with the dataset. I cannot see the 9 custom IOB labels inside ClassLabel.

df = pd.DataFrame(df)
dataset = Dataset.from_pandas(df)
dataset = dataset.train_test_split(test_size=0.1)

Output:
DatasetDict({
    train: Dataset({
        features: ['tokens', 'labels', 'id'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'labels', 'id'],
        num_rows: 1000
    })
})

Output:
{'tokens': Value(dtype='string', id=None),
 **'labels': Value(dtype='string', id=None),**
 'id': Value(dtype='int64', id=None)}

I already tried the “cast” method → dataset.cast_column(“labels”
here … ClassLabel Error · Issue #5737 · huggingface/datasets · GitHub

And the “new_features” in the package reference.
here… Main classes

@mariosasko ?

Thank you guys!

mariosasko · June 27, 2023, 2:09pm

Try this:

df = pd.DataFrame(df)
dataset = Dataset.from_pandas(df)
dataset = dataset.class_encode_column("labels")
dataset = dataset.train_test_split(test_size=0.1)

R0m4ntic · June 28, 2023, 4:42pm

Thank you for your help @mariosasko

It’s correctly creating a ClassLabel(names=
Unfortunately, it’s appending all the labels on each row…

Column df[“labels”] with many rows as follow [‘O,O,B-DRUG,O,B-HOSPITAL,O’]

With encode_column, my results is a very long ClassLabel(names= [‘O,O,B-DRUG,O,B-HOSPITAL,O,B-HOSPITAL,O,B-DATE,O,O,O’]

Maybe the labels shouldn’t be a Pandas Series?

mariosasko · June 28, 2023, 6:09pm

Oh, I missed the “Token Classification” part.

Then this should work:

df = pd.DataFrame(df)
dataset = Dataset.from_pandas(df)
dataset = dataset.map(lambda ex: {"labels": ex["labels"].split(",")})

def get_label_list(labels):
    # copied from https://github.com/huggingface/transformers/blob/66fd3a8d626a32989f4569260db32785c6cbf42a/examples/pytorch/token-classification/run_ner.py#L320
    unique_labels = set()
    for label in labels:
        unique_labels = unique_labels | set(label)
    label_list = list(unique_labels)
    label_list.sort()
    return label_list

all_labels = get_label_list(dataset["labels"])

dataset = dataset.cast_column("labels", datasets.Sequence(datasets.ClassLable(names=all_labels)))
dataset = dataset.train_test_split(test_size=0.1)

R0m4ntic · June 28, 2023, 6:58pm

Perfect! Thank you @mariosasko

Have a good evening!

R0m4ntic · July 4, 2023, 12:16pm

Sorry to bother you again @mariosasko

I am trying to create a “ner_tags” column with integers corresponding to “labels” to follow the HF tutorial on Token Classification.

I tried that…

tags = dataset.features[f"labels"].feature.names
print(tags)

def create_tag_names(batch):
    return {"ner_tags_str": [tags.str2int(idx) for idx in batch["labels"]]}
dataset = dataset.map(create_tag_names)

Any ideas? Thx!

mariosasko · July 4, 2023, 1:04pm

I am trying to create a “ner_tags” column with integers corresponding to “labels” to follow the HF tutorial on Token Classification.

Can you provide a link to the tutorial? We already store class labels as integers, so I’m not sure I understand what you want to do.

R0m4ntic · July 4, 2023, 1:18pm

Oops, sorry!

The tutorial is working with the column ner_tags made of numbers mapping to the corresponding labels…
Thx!

mariosasko · July 4, 2023, 2:12pm

You can rename the labels column to ner_tags to have the same structure:

dataset = dataset.rename_column("labels", "ner_tags")

And apply the rest of the processing.

R0m4ntic · July 4, 2023, 3:23pm

@mariosasko Thank you for your help!

I found the solution to my stupid bug on “Tokens”. I transformed my sentences in “Tokens” into a list of tokens by adding the following line:

dataset = dataset.map(lambda ex: {“tokens”: ex[“tokens”].split(“,”)})

Thx again! : )

Topic		Replies	Views
Add Sequence(feature=ClassLabel(...), ...) to an existing dataset 🤗Datasets	1	1622	May 2, 2022
Creating a Sequence of ClassLabel for multi-label and multi-class problems 🤗Datasets	5	727	March 26, 2024
Class Labels for Custom Datasets 🤗Datasets	4	17906	June 2, 2022
How to create custom ClassLabels? 🤗Datasets	3	7451	January 20, 2022
Dataset Object without ClassLabel 🤗Datasets	3	1096	March 8, 2023

Sequence features - Class Label Cast_

Related topics