Does a tokenizer keep the mapping between my labels to their encoding?

revuze · March 31, 2022, 7:49am

And if not, where can I keep it? Do I have to map it myself or can it be done automatically somewhere?
I have a multi class use case, and so far I’ve used LabelEncoder/Binarizer class. Is there a better way?

Thanks

lucadini · March 31, 2022, 1:24pm

If you ask this for token classification, you can pass already tokenized text to the tokenizer and set the parameter is_split_into_words=True. The tokenized text will have the parameter word_ids that contains the mapping from your tokenization and the additional tokenization done by the tokenizer.
In this script for token classification there is tokenize_and_align_labels function that I think does what you need.

revuze · April 3, 2022, 8:17am

I see, thank you, but actually I was referring more to the labels of prediction. Maybe it’s unrelated to tokenizers?
Say I got 3 labels, the simple mapping from [0,1,2] to [“label1”, “label2”, “label3”] - I assume there should be a place the mapping is saved automatically but maybe I’m wrong?

lucadini · April 4, 2022, 1:53pm

I’m sorry I didn’t get it! In the same script you can find “id2label” and “label2id” that are manually set and passed to the model’s config. These are dictionaries mapping from labels to label ids and vice versa. Try to check if your model config has this parameter set or do the mapping manually and then add the parameter!

Topic		Replies	Views
Predicting with Token Classifier on data with no gold labels Beginners	1	1432	August 20, 2021
Convert tokens and token-labels to string 🤗Transformers	7	7524	March 12, 2022
Tokenizer mapping the same token to multiple token_ids 🤗Tokenizers	4	677	April 22, 2024
Label 2 id not working Beginners	1	183	June 12, 2025
Understanding multi-label classification training Beginners	0	820	February 14, 2023

Does a tokenizer keep the mapping between my labels to their encoding?

Related topics