I’m currently working on a model that should perform extractive summarization, flatten the output and make it conform to a style guide. BART for conditional generation has delivered very good results so far. The main issue the model has seems to be the tokenization of entities with strange names, mainly companies. For instance, Boeing
confuses the model because it looks like Boe
+ ing
.
I see two main options:
- Unsupervised fine-tuning of the BERT-style encoder on the kind of text that needs to be summarized. See also this post.
- Performing a NER preprocessing step, which could place markers around the entities (Boeing to ) so that the model can see that these entities never change.
I’m not sure how 1. would be done, technically speaking.
On the other hand 2. seems viable, has anyone done something similar in the past? What would be a smart way to wrap the entities?
If someone has a better idea, I’d be very grateful to hear that, too!