Token alignment for word-level tasks

Hey guys,

I want to make a POS tagger on top of BERT. The dataset is in conll-u format [1], input sentences are already tokenized; input word tokens are mapped to labels (POS etc ). Therefore, I have to take special care of input/output alignment as BERT will add additional tokens during tokenization, similar to what is described in the original BERT repo [2].

So I took the alignment algorithm outlined in [2]; I tokenized each input word and map/expanded labels for each BERT tokenized word. I then needed to add padding to each sentence, attach attention_mask, convert to tensors etc. Looking at Tokenizer API I didn’t see how to do this easily if I already have sentences and labels converted to token ids in the first step.

This whole exercise left me wondering if there is a simpler and less verbose approach to these word-level tasks using Tokenizer API?

Cheers,
Vladimir

[1] https://universaldependencies.org/format#conll-u-format
[2] https://github.com/google-research/bert#tokenization

I found an answer in utils_ner.py See function convert_examples_to_features; it does what I was doing except in a more general model-agnostic approach.

Cheers,
Vladimir