Token alignment for word-level tasks

vblagoje · August 3, 2020, 4:46pm

Hey guys,

I want to make a POS tagger on top of BERT. The dataset is in conll-u format [1], input sentences are already tokenized; input word tokens are mapped to labels (POS etc ). Therefore, I have to take special care of input/output alignment as BERT will add additional tokens during tokenization, similar to what is described in the original BERT repo [2].

So I took the alignment algorithm outlined in [2]; I tokenized each input word and map/expanded labels for each BERT tokenized word. I then needed to add padding to each sentence, attach attention_mask, convert to tensors etc. Looking at Tokenizer API I didn’t see how to do this easily if I already have sentences and labels converted to token ids in the first step.

This whole exercise left me wondering if there is a simpler and less verbose approach to these word-level tasks using Tokenizer API?

Cheers,
Vladimir

[1] https://universaldependencies.org/format#conll-u-format
[2] https://github.com/google-research/bert#tokenization

vblagoje · August 5, 2020, 10:07am

I found an answer in utils_ner.py See function convert_examples_to_features; it does what I was doing except in a more general model-agnostic approach.

Cheers,
Vladimir

Topic		Replies	Views
Multi-input tag and ,multi-label output for token classification using Bert pretrained model 🤗Transformers	1	86	January 9, 2025
Converting Word-level labels to WordPiece-level for Token Classification Intermediate	9	4560	January 13, 2021
Token classification on custom BERT and data Intermediate	2	1499	December 28, 2020
BERT for NER output of only '0' Beginners	0	671	November 14, 2021
How to deal with differences between CoNLL 2003 dataset tokenisation and BER tokeniser when fine tuning NER model? Intermediate	6	2727	November 23, 2021

Token alignment for word-level tasks

Related topics