I’m trying to classify Yelp-like reviews based on a personalized set of topics that can change over time. Zero-shot classification seems to be a good way to approach this task, and I’ve started working with bart-large-mnli, the recommended model for this domain.
However, as I’ve gotten a feel for the performance of bart-large-mnli, I’ve realized that it struggles with roughly/highly tokenized topics. For example, when trying to classify reviews as related to “covid”, which is tokenized as [‘cov’, ‘id’] by bart-large’s tokenizer, I get many false positives, such as reviews that focus on “collaboration,” “cooperation”, “competition”, or other words with similar beginnings.
I understand that this is a general weakness of tokenization in NLP; however, it is especially pronounced in this form of zero-shot classification and thus can’t be ignored.
I wanted to get the community’s opinion on how to address this issue. Which of the following high-level options would you choose?
- Extend the vocabulary of bart-large for very popular topics (like the aforementioned ‘covid’) and conduct additional pretraining for the new embeddings
- Choose another model with a larger vocabulary of tokens
- Some other approach
I greatly appreciate any kind of feedback you can provide.