From what I can tell, you might have wandered into some kind of strange maze:
For your case, the main problem is not that embeddings are weak. It is that you are asking one similarity score to do three different jobs at once:
- find roughly related intents,
- separate very close intents, and
- reject intents that are semantically nearby but still wrong.
Recent work on few-shot text classification and intent classification keeps finding the same pattern: label semantics matter, richer class descriptions help, hard-negative boundaries matter, and performance drops when overlapping intents are added without explicitly arbitrating the collision. (ACL Anthology)
Why your current setup is failing
Dense sentence embeddings are good at broad semantic grouping. They are much worse when the true distinction is carried by a small number of functional cues such as:
- the main action,
- the target object,
- the required state,
- the relation between entities,
- or an exclusion condition.
That is exactly why papers on label semantics and complex class descriptions report gains from using richer class representations rather than short labels or plain sentence similarity. The “complex class descriptions” paper is especially relevant here because it reframes classification as matching examples to class descriptions, and reports strong few-shot gains over baselines when classes are too complex to be represented by short names alone. (ACL Anthology)
There is also a second failure mode: your model probably has weak negative boundaries. The hard-negative OOS paper shows that intent classifiers struggle much more on out-of-scope inputs that are semantically close to in-scope intents than on generic OOS inputs, and that adding such hard negatives improves robustness. In other words, the model often does not really know what a class excludes. (ACL Anthology)
A third failure mode is structural. Some confusions are not really model mistakes. They are taxonomy collisions. Redwood was built around this exact idea and shows that performance suffers when colliding intents are added without arbitration. That matters for your problem because some of your “overlap” may come from the ontology itself rather than the encoder. (ACL Anthology)
The most useful mental model
Do not think of this as “intent = one embedding.”
Think of it as:
- candidate generation,
- pairwise disambiguation,
- accept / reject decision.
Sentence Transformers’ retrieve-and-rerank guidance matches this well: use a fast first stage to retrieve plausible candidates, then a stronger reranker for precision. That architecture is a much better fit for your problem than a single cosine score. (Sbert)
What I would build instead
1. Replace label strings with intent cards
Each intent should be represented as a structured object, not just a title and a few paraphrases.
A good intent card should contain:
- a positive definition,
- the core action,
- the target object,
- required conditions,
- exclusions,
- nearest confusable intents,
- a handful of positive examples,
- and a handful of negative examples.
Example shape:
Intent: cancel_subscription
Positive definition:
The user wants to end an existing subscription or stop future renewals.
Core action:
cancel / stop / terminate
Target object:
subscription / membership / recurring plan
Required conditions:
An existing active or recurring service is implied.
Exclusions:
- canceling a one-time order
- pausing temporarily
- changing the plan tier
- asking about pricing only
Confusable intents:
pause_subscription
change_plan
cancel_order
Positive examples:
"Stop my membership."
"I want to cancel the premium plan."
Negative examples:
"I only want to pause it."
"How much does the premium plan cost?"
"I need to cancel yesterday's order."
This is not just good documentation. It is a better machine-readable class representation, and it is supported by the literature on label semantics and complex class descriptions. (ACL Anthology)
2. Make the first stage sparse or hybrid, not dense-only
For your setting, I would give lexical evidence more authority than dense similarity.
Why:
- sparse methods keep token-level evidence visible,
- common terms are naturally down-weighted,
- and hybrid retrieval is explicitly recommended in current Sentence Transformers material for combining recall and precision. (Sbert)
That means the first stage should be one of these:
- BM25 or TF-IDF,
- a neural sparse encoder,
- or sparse + dense hybrid retrieval.
This stage should only produce top-k candidate intents. It should not make the final decision.
That is the first place where you solve “high-recall generic terms dominate scoring.” Sparse retrieval is much better than dense similarity at exposing which terms are carrying the match. The Sentence Transformers sparse docs and Hugging Face’s sparse-encoder article both position sparse methods as a useful middle ground between classic lexical retrieval and dense embeddings. (Sbert)
3. Put a supervised sparse classifier at the center
For your actual decision layer, I would use a sparse linear classifier with explicit features.
This is the stack:
- word n-grams,
- character n-grams,
- action features,
- object features,
- slot/entity flags,
- negation and modality features,
- state or status features.
In scikit-learn terms, that means:
TfidfVectorizer for word n-grams,
TfidfVectorizer(analyzer="char_wb") for robust character features,
DictVectorizer or similar for hand-built symbolic features,
FeatureUnion to merge them,
OneVsRestClassifier with LogisticRegression,
- then calibration and threshold tuning. (scikit-learn)
Why this is a strong fit:
char_wb preserves useful subword and wording cues without exploding noise. (scikit-learn)
FeatureUnion lets you combine lexical and symbolic signals in one model. (scikit-learn)
OneVsRestClassifier gives you per-class decision functions, which is better for class-specific boundaries. (Hugging Face)
LogisticRegression handles sparse inputs directly. (scikit-learn)
This gives you something dense-only approaches do not: signed evidence. A feature can actively support one class and actively hurt another.
That is how you start modeling “this signal should exclude the class.”
4. Use a reranker only for close calls
Your NLI-style second pass is a good instinct. The problem is using it too broadly.
A better setup is:
- first stage retrieves top 3–5 intents,
- second stage reranks only those candidates against the full intent card.
That is exactly the retrieve-and-rerank pattern recommended in the Sentence Transformers docs. (Sbert)
This second stage can be:
- a cross-encoder,
- an NLI model,
- or a label-description matching model.
The “complex class descriptions” paper and the intent-aware encoder paper both support this direction: intent classification improves when the model is allowed to align utterances with richer intent semantics, not just raw surface similarity. (ACL Anthology)
So the reranker’s question is not:
“Which class is nearest in embedding space?”
It is:
“Given these few candidate intents, which full intent description best matches this utterance, and which exclusions are violated?”
That is a much better question.
5. Model negative boundaries with data, not rules
Your biggest gap is probably here.
The clean way to encode negative boundaries is to build three kinds of negatives for each intent:
Sibling negatives
Examples from the most confusable neighboring intents.
Minimal-pair negatives
Small edits that flip the class:
- change the action,
- change the object,
- add or remove negation,
- change the status,
- swap the relation.
Hard-negative OOS
Examples that look domain-relevant and share vocabulary, but belong to no supported intent.
This is directly supported by the hard-negative OOS paper. Generic OOS is too easy. You need close, misleading negatives. (ACL Anthology)
For example, if an intent is change_billing_date, useful hard negatives are not random unrelated queries. They are things like:
- “Why was I billed on this date?”
- “Can I change my payment method?”
- “Pause my subscription until next month.”
- “Move my renewal to next week.”
These sit near the decision boundary. That is exactly where your system is weak.
6. Treat confusion pairs as first-class objects
Do not only inspect global accuracy, macro-F1, or top-1 intent accuracy.
Build a confusion graph:
- nodes = intents,
- edge weight = how often two intents appear as top-2 candidates or get confused.
Then take the worst edges and treat each as its own subproblem.
For each bad pair, ask:
- are the positive definitions actually distinct,
- are the exclusions explicit,
- are the examples balanced,
- should this distinction be an entity or slot instead of an intent split,
- is this actually a multi-intent case,
- or should the pair be merged?
Redwood is the main source behind this advice. It shows that collision handling is not optional when intent sets grow or overlap. (ACL Anthology)
There is also a very practical community lesson here. Threads from Rasa and Stack Overflow often end up concluding that some “different intents” are actually the same intent plus a different entity or slot. That is not a deep theorem, but it is a useful diagnostic pattern. (Rasa Community Forum)
7. Add a real reject path
You should not force a label for every input.
Use:
- a per-intent acceptance threshold,
- a top-1 minus top-2 margin threshold,
- and optionally an OOS detector.
scikit-learn now provides both CalibratedClassifierCV and TunedThresholdClassifierCV. The first calibrates scores. The second explicitly tunes the cut-off used to turn scores into labels. That is exactly the tooling you want for “accept, reject, or escalate.” (scikit-learn)
This is much cleaner than hand-written rules like “if score < 0.62, return unknown.”
It also aligns with a real production pain point: confidence scores are often unreliable across intents, especially when some classes are much tighter than others. (Rasa Community Forum)
A concrete lightweight pipeline
This is the pipeline I would actually recommend.
Stage A. Candidate generation
Use:
- TF-IDF or BM25,
- or a sparse / hybrid retriever.
Goal:
- high recall,
- top 5 intent candidates,
- transparent lexical evidence.
Supported by current sparse and retrieve-rerank guidance. (Sbert)
Stage B. Feature-based supervised classifier
Train an OvR classifier on:
- word TF-IDF,
- char
char_wb TF-IDF,
- action/object/slot/state/negation features,
- optional metadata.
Goal:
- class-specific signed evidence,
- interpretable coefficients,
- better precision on overlapping classes. (scikit-learn)
Stage C. Pairwise reranking
For only the top few candidates:
- compare utterance vs full intent card,
- score entailment / compatibility / exclusion violations.
Goal:
- resolve the close cases where the sparse backbone still hesitates. (ACL Anthology)
Stage D. Thresholded decision
Use:
- calibrated probability or decision score,
- per-class threshold,
- top-1 minus top-2 margin,
- OOS fallback.
Goal:
How to prioritize discriminative signals
You asked specifically about weighting discriminative signals over generic ones.
I would do it in four ways.
A. Use sparse lexical features
That naturally suppresses common terms better than dense similarity. (Sbert)
B. Add feature selection
For non-negative sparse features, scikit-learn’s chi2 can rank features by class association. That gives you a simple way to identify which terms are actually discriminative and which are just frequent. (scikit-learn)
C. Use regularized linear models
Regularized logistic regression on sparse features is a strong baseline for text classification and handles sparse inputs directly. (scikit-learn)
D. Split generic terms from functional features
Do not let “subscription,” “account,” “billing,” or “transfer” be the only strong signals. Put action and constraint features in their own channel so the model can learn that:
cancel + subscription matters,
pause + subscription is different,
why + subscription billed is different again.
That separation is the practical version of the “functional signals” idea you raised.
Better intent representations than plain embeddings
The strongest alternatives are:
1. Intent cards
Best overall option for your use case. Supported by label-semantics and complex-description work. (ACL Anthology)
2. Intent name + keyphrase set
Supported by the intent-aware encoder work, which tries to align utterances with intent names and key phrases rather than only whole utterance similarity. (ACL Anthology)
3. Positive and negative prototypes
Instead of one prototype per class, keep:
- one positive prototype bank,
- one exclusion prototype bank.
Then score both:
- “How much does this look like class A?”
- “How much does this violate class A?”
That is not directly from one single paper, but it is consistent with the hard-negative and complex-description literature. (ACL Anthology)
What data work matters most
With only a few examples per intent, the biggest gains usually come from better examples, not a fancier model.
The most valuable new examples are:
- confusion-pair minimal pairs,
- hard-negative near-misses,
- OOS queries that share vocabulary with in-scope intents,
- and examples covering status and constraint changes.
The few-shot retrieval paper is useful here because it treats adaptation as changing the retrieval index and the local evidence, not retraining a giant model every time. That is very aligned with your low-data constraint. (ACL Anthology)
Common pitfalls for your exact setup
1. Using short intent labels as if they were enough
They usually are not. (ACL Anthology)
2. Trusting dense similarity too early
Even sentence-transformer users have reported cases where BM25 outperformed embedding search on their task. That is anecdotal, not universal, but it is a good warning against assuming semantic search always wins. (GitHub)
3. Using generic OOS only
Too easy. Hard-negative OOS is the real test. (ACL Anthology)
4. Treating every bad confusion as a model issue
Some are taxonomy issues or mislabeled data issues. BANKING77 even has a paper reporting potential label errors affecting intent-classification experiments. (ACL Anthology)
5. One global threshold for all intents
Usually too crude. Use per-intent thresholding and calibration instead. (scikit-learn)
The best practical starting point
For your exact constraints, the highest-leverage first build is:
- sparse lexical first stage,
- intent cards,
- OvR sparse logistic regression with hand-built functional features,
- rerank top 3 candidates with an NLI or cross-encoder style matcher,
- calibrated thresholds plus OOS fallback,
- and a confusion-driven hard-negative collection loop.
That stack is lightweight, interpretable, fast enough for production, and much better aligned with overlapping semantic boundaries than plain embedding routing. It is also consistent with current few-shot tooling like SetFit, which is explicitly designed for little labeled data and supports a logistic-regression classification head on top of a SentenceTransformer body. (GitHub)
The single most useful next move
Build a baseline that deliberately does not depend on dense similarity:
- word TF-IDF,
- char
char_wb,
- slot/entity/action/object/negation features,
- OvR logistic regression,
- calibration,
- pairwise hard-negative evaluation.
That baseline will tell you very quickly whether your real problem is:
- representation,
- boundary quality,
- or ontology design.
Once you know which of those three is dominant, the rest of the roadmap becomes much clearer.