(feat Tokenizers): How to make models aware of structuring linebreaks?

Hi everyone!
I’ll try to explain briefly the task I am trying to solve.

Shortly: I would like my model to take into account newline markers in my text samples because I believe them to be highly informative in my case. Trying the methods that propose Transformers to insert new custom specials tokens yielded decreased performances.

Long story :

I have a bunch of official documents (IDs, driving licenses, etc…) from which I extract the textual content with OCR models. I want to see if I could train a unified NER-TokenClassification pipeline to parse different entities (lastname, firstname, birthdate, birthplace, etc…), that would generalize well to a wild variety of document kinds and countries.
Methods based on regex patterns and/or computer vision work but can hardly be agnostic to the kind/language of the document and also suffer from OCR errors artefacts.

For clarity sake, here’s what a basic sample coud look like : (fictive data)

* * * PERMIS DE CONDUIRE RÉPUBLIQUE FRANÇAISE F *
1. DOE
2. JOHN
URQA 3.
01.01.1901 (UTOPIC CITY)
CONO
4a 01.02.2003
4b.01.02.2018
FRANCAIS
**

I annotated my samples with the BIO-schema and trained a TokenClassification model based google/electra-large-discriminator. It worked like a charm, reaching high validation metrics! :blush:

However, I noticed the model would sometime fail because \n newlines markers are ignored during the tokenization and used as a splitting criteria.
For instance is the example above JOHN\nURQA 3. would be splitted maybe as ['JOHN', 'URQA', '3', '.'], making it ‘invisible’ to the model that there was actually a newline separator between the tokens.
But that is precisely this newline indicator that makes it easy to realize that the actual entity is JOHN and that the OCR artefact URQA does not belong to the name.

So I decided to preprocess my samples, replace \n markers by [NL] and adjust my BIO annotations accordingly (I don’t suspect a bug coming from this part).

Then I loaded the-pretrained model to finetune as such:

CHECKPOINT = 'google/electra-large-discriminator'

tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
model = AutoModelForTokenClassification.from_pretrained(CHECKPOINT, num_labels=len(tag2id))

tokenizer.add_special_tokens({'additional_special_tokens': ['[NL]']})
model.resize_token_embeddings(len(tokenizer))
model.get_input_embeddings().padding_idx = 0

Tokenization after this went as expected, not splitting the [NL] tokens and setting them a new token_id.
Also the embedding matrix weights are unchanged after resize_token_embeddings, except of course for a new line that has been inserted and initialized (for the special [NL] token).

So I thought my model would now be aware of “newline splits” and behave even better. But actually the performances clearly degraded, which surprised me as newline breaks seem to be highly informative in this case.

Here come the questions :smile: :

  • Is this the correct way to handle custom token insertions ?
  • By doing this, do I loose the pretrained weights/biases information, maybe explaining the reason performances went down ?
  • What would be a good way for me to make the model aware of newlines markers ?

Thank you for reading!

(EDIT:

  • the new [NL] special tokens are assigned the label O, “no entity”.
  • When I said performances decreased, more precisely the model was not able anymore to predict other things that O label, ie. not detecting any entity. I will check if its able to overfit a small set of samples, haven’t tested this yet)