The last layer of a BertForTokenClassification model is a linear classifier, which associates a tag to each input token. Tags are inherently independent, as in the current tag associated to the current token is not dependent on what the previous tag was for the previous token. Context information would only propagate through the encoded self attention.This would allow IOB tags to be output out of order as I was able to observe.
For example,
[“I”, “went”, “to”, “the”, “beautiful”, “country”] can potentially have tags [“O”, “O”, “O”, “O”, “I-LOCATION”, “I-LOCATION”] instead of [“O”, “O”, “O”, “O”, “B-LOCATION”, “I-LOCATION”]
Is the subtle standard to not use IOB tags for token classification? Or it is not expected to lead to out of order tags and that I might have misconfigured a component?
I am starting to think that I must move away from the IOB tags.