Hi all, I am using OpenNLP to implement NLP search and using Document Categorization for mapping text search with the expected response. I needed guidance on what should be the size of training data, how much per category rows are required, currently my training data has 5000 rows.
âHow much data per category?â seems to be important.
What OpenNLP âDocument Categorizationâ is (and why âhow much dataâ has no single answer)
OpenNLP Doccat is a supervised text classifier: you provide labeled examples (<category> <text>) and it learns to map new text to one of your categories. OpenNLPâs manual describes the required training format as one document per line with category + whitespace + text, and notes you can tune iterations and cutoff during training. (opennlp.apache.org)
Because itâs learning from the distribution of words / phrases that appear in each category, the amount of training data you need depends mainly on:
- Number of categories
- How similar categories are (easy: âweatherâ vs âbillingâ; hard: ârefundâ vs âcharge disputeâ)
- How varied your user text is (short queries, typos, synonyms, domain jargon)
- Label quality / consistency
- Whether you handle ânone-of-the-aboveâ (out-of-scope)
Practical âhow many rows per categoryâ guidance (usable ranges)
Think in terms of examples per category. The ranges below are practical planning numbers, grounded by common intent/NLU guidance and benchmark datasets:
1) Bare minimum (prototype)
- 10â20 examples per category can be enough to get a prototype working for simple, distinct categories. This is a widely used starting recommendation in intent systems. (Google Cloud Documentation)
Reality check: with bag-of-words style models, this often misses phrasing variety and fails on real traffic.
2) Typical âusable baselineâ
-
50â100 per category is a common zone where models become noticeably more stable.
- Example: LivePerson recommends 60â100 training phrases per intent for good performance. (developers.liveperson.com)
- Example: CLINC150 (an intent benchmark) uses 100 training samples per intent across 150 intents. (GitHub)
3) More reliable production behavior
-
200â500 per category is often needed when:
- categories overlap semantically,
- users are noisy/short/ambiguous,
- you need higher recall across all categories (including smaller ones).
4) Hard mode (very overlapping categories, high accuracy requirements)
-
1000+ per category is not unusual in benchmarks with strong performance expectations.
- SNIPS intent benchmark is roughly ~2000 training sentences per intent. (Intel Labs)
Key point: itâs not just âmore rowsâ; itâs coverage of phrasing variety inside each category.
Interpreting your â5000 rowsâ (the quick math)
Let C = number of categories.
Average per category = 5000 / C
| Categories (C) | Avg / category | What that usually means |
|---|---|---|
| 10 | 500 | Often strong baseline if labels are clean |
| 25 | 200 | Often workable; focus on overlap + edge cases |
| 50 | 100 | Baseline territory (similar to CLINC150 per-intent scale) (GitHub) |
| 100 | 50 | Fragile unless categories are very distinct |
| 150 | 33 | Usually too thin; many categories will be weak |
What matters even more than the average:
- Smallest category size (long tail)
- Confusable category pairs (taxonomy overlap)
If you have categories with <20â30 examples, expect those to have poor recall unless theyâre extremely distinct.
OpenNLP-specific pitfalls that affect âhow much data you needâ
1) Cutoff (feature frequency pruning) can make small datasets look âbrokenâ
OpenNLPâs training parameters have defaults: cutoff = 5, iterations = 100. (opennlp.apache.org)
- Cutoff = 5 means features (words/ngrams) seen fewer than 5 times are dropped.
- If your dataset is small or highly split across many categories, many informative features can be pruned away â weak models, âdropped eventsâ, or ânot enough training dataâ style failures seen in practice. (Stack Overflow)
Practical guidance
- If many categories have <50 examples, try cutoff 1â2 (or even 0/1 in very small regimes) and validate carefully.
- If you have lots of data and noise, cutoff 5+ can be fine.
2) Tokenization mismatch (training vs inference) can collapse predictions
Thereâs a known documentation issue where an example fed a raw string into categorize() even though the method expects token arrays; this is tracked as OPENNLP-1307. (issues.apache.org)
Practical guidance
- Ensure inference uses the same tokenization/normalization as training (lowercasing, punctuation handling, etc.).
- If you change tokenization between train and predict, you can get âeverything predicts the same categoryâ behavior. (Stack Overflow)
3) Bag-of-words alone may not separate similar intents well â use n-grams
OpenNLP supports feature generators such as:
BagOfWordsFeatureGenerator(opennlp.apache.org)NGramFeatureGenerator(e.g., bigrams/trigrams) (opennlp.apache.org)
If your categories depend on short phrases (âreset passwordâ vs âreset accountâ), n-grams often reduce confusion.
âHow do I decide if I need more data?â (the reliable method)
Build a learning curve from your 5000 rows
Train/evaluate with progressively more data: 10%, 20%, 40%, 60%, 80%, 100%.
- If performance is still improving at 5000 â more labeled data will likely help.
- If it plateaus early â adding more of the same wonât help much; prioritize taxonomy cleanup, label quality, feature tuning (ngrams), and out-of-scope handling.
Use cross-validation if you donât have a big test set
OpenNLP provides a DoccatCrossValidator to do k-fold evaluation. (opennlp.apache.org)
The âNLP search â expected responseâ pattern: what usually works best
1) Keep categories âintent-likeâ, not âanswer-likeâ
If your categories are essentially full responses (very granular), you often end up with:
- too many categories,
- too few examples per category,
- lots of overlap.
A common approach is:
- classify into a smaller intent set (what user wants),
- then retrieve/assemble the best answer from a knowledge base.
2) Add an out-of-scope / fallback strategy
Benchmarks like CLINC show that models can classify in-scope intents well yet struggle to detect out-of-scope queries. (ACL Anthology)
In practice:
- Use the top-score probability and apply a threshold (âif confidence < T â fallback / ask a clarifying questionâ)
- Optionally train an OTHER/OOS category with representative ânone-of-the-aboveâ queries.
What I would do next with your dataset (action plan)
-
Compute distribution
- examples per category (min / median / max)
- number of categories under 20 / 50 / 100
-
Tune OpenNLP training parameters to your regime
- If long-tail is severe: lower cutoff from 5 (default) (opennlp.apache.org)
- Keep iterations reasonable; donât assume âmore iterations fixes dataâ
-
Add n-gram features if categories are confusable
NGramFeatureGenerator(minGram=2, maxGram=3)is a common starting point (opennlp.apache.org)
-
Do targeted data collection
- Add examples specifically for the top confusion pairs
- Donât add rows uniformly; add where the model fails
-
Implement fallback / OOS
- Thresholding + âOTHERâ category if needed (ACL Anthology)
Bottom line for your â5000 rowsâ
- If you have ~50 categories (â100/category average): often a workable baseline, comparable to common intent benchmark scales. (GitHub)
- If you have 100+ categories (â¤50/category average): expect fragile behavior unless categories are very distinct; youâll likely need either more data, fewer categories (merge/hierarchy), or stronger features/approach.
- Regardless of count: ensure tokenization consistency (OPENNLP-1307 class of issues) and tune cutoff so you arenât discarding most of your signal. (issues.apache.org)
@John6666 Thanks for your response!