Improving zero-shot classification for roughly tokenized labels

llauner · December 30, 2021, 1:23am

I’m trying to classify Yelp-like reviews based on a personalized set of topics that can change over time. Zero-shot classification seems to be a good way to approach this task, and I’ve started working with bart-large-mnli, the recommended model for this domain.

However, as I’ve gotten a feel for the performance of bart-large-mnli, I’ve realized that it struggles with roughly/highly tokenized topics. For example, when trying to classify reviews as related to “covid”, which is tokenized as [‘cov’, ‘id’] by bart-large’s tokenizer, I get many false positives, such as reviews that focus on “collaboration,” “cooperation”, “competition”, or other words with similar beginnings.

I understand that this is a general weakness of tokenization in NLP; however, it is especially pronounced in this form of zero-shot classification and thus can’t be ignored.

I wanted to get the community’s opinion on how to address this issue. Which of the following high-level options would you choose?

Extend the vocabulary of bart-large for very popular topics (like the aforementioned ‘covid’) and conduct additional pretraining for the new embeddings
Choose another model with a larger vocabulary of tokens
Some other approach

I greatly appreciate any kind of feedback you can provide.

Topic		Replies	Views
Speeding up zero shot classification [Solved] Beginners	5	6093	September 9, 2020
Alternative approaches for text classification task 🤗Transformers	0	430	October 25, 2022
Topic classification: is zero-shot the way? Beginners	0	305	August 12, 2021
Model for Text Classification similar to bart-large-mnli, for TensorFlow Beginners	0	499	May 6, 2022
Zero shot classification with manual pytorch Beginners	0	731	August 27, 2021

Improving zero-shot classification for roughly tokenized labels

Related topics