I want to make a POS tagger on top of BERT. The dataset is in conll-u format , input sentences are already tokenized; input word tokens are mapped to labels (POS etc ). Therefore, I have to take special care of input/output alignment as BERT will add additional tokens during tokenization, similar to what is described in the original BERT repo .
So I took the alignment algorithm outlined in ; I tokenized each input word and map/expanded labels for each BERT tokenized word. I then needed to add padding to each sentence, attach attention_mask, convert to tensors etc. Looking at Tokenizer API I didn’t see how to do this easily if I already have sentences and labels converted to token ids in the first step.
This whole exercise left me wondering if there is a simpler and less verbose approach to these word-level tasks using Tokenizer API?