For example, using BERT in a token classification task, I get something like this ā¦
[('Darüber', 17), ('hinaus', 17), ('fanden', 17), ('die', 17), ('Er', 17), ('##mitt', -100), ('##ler', -100), ('eine', 17), ('Ver', 17), ('##legung', -100), ('##sli', -100), ('##ste', -100), (',', 17), ('die', 17), ('bestätigt', 17), (',', 17), ('dass', 17), ('Dem', 8), ('##jan', -100), ('##juk', -100), ('am', 17), ('27', 17), ('.', -100), ('März', 17), ('1943', 17), ('an', 17), ('die', 17), ('Dienst', 17), ('##stelle', -100), ('So', 0), ('##bi', -100), ('##bor', -100), ('ab', 17), ('##kom', -100), ('##mand', -100), ('##iert', -100), ('wurde', 17), ('.', -100)]
⦠in the format of (sub-token, label id).
Is there a way I can automatically know that ā##mittā and ā##lerā are part of āErā (thus making up the word āErmittlerā) that would work across all tokenizers (not just BERT)?