Maybe it’s because that model weight is being trained on news articles? (Detailed version)
Why DistilBERT NER struggles on Android strings.xml-style text
Models like DistilBERT fine-tuned on CoNLL-2003 learn “news text” patterns. Android resource strings add systematic noise that CoNLL rarely contains:
- Wrapper + key noise:
<string name="..."> ... </string> and CamelCase keys produce false positives if you run NER on the raw line.
- UI templates: “X would like to …”, “Turned off for “X”” contain Title Case phrases that look entity-like.
- ID / token-y text:
AADHAAR-OTP, uppercase codes, and punctuation-heavy tokens.
- Subword boundary quirks: pipeline aggregation can hide tokenization artifacts; offset-based reconstruction is often more stable for char spans. (GitHub)
Recommended approach (practical path that usually works)
1) Normalize the input (biggest immediate precision gain)
Run the model on the extracted value only, not the raw XML-ish line. Keep name= as metadata if you want, but don’t feed it into the NER model by default.
This is also where you:
- unescape
\", XML entities,
- strip outer quotes,
- optionally drop obvious wrappers.
(If you later need name signals, treat it as a separate feature or a second-stage reranker.)
2) Create a small in-domain labeled dataset (yes, this is the core)
For your domain, there isn’t a public dataset that matches the noise/UI-template distribution, so a custom dataset is the correct direction.
Key points:
- Define a labeling policy for ambiguous tokens (e.g., is
AADHAAR-OTP an entity? usually it’s an ID/product token → often “not an entity”).
- Include many hard negatives: UI nouns (“Camera”, “Location Services”, “Office”) that should not be ORG/LOC/PER.
- Split train/valid/test by app/package (prevents leakage of app-specific brand strings).
Annotation tooling options:
- Argilla supports span/token labeling workflows and has token-classification tutorials; Hugging Face also has practical posts on fine-tuning with annotated domain data. (Argilla Docs)
3) Bootstrap labels with weak supervision (to scale faster)
Before labeling thousands of strings manually, use high-precision heuristics to generate “weak labels”, then review/correct:
Examples of high-precision rules in your domain:
- quoted app names:
for "Amazon" → ORG
- template heads:
Amazon would like to ... → ORG
- known brand list/gazetteer matches (Amazon, Google Pay, Microsoft, …) → ORG
Snorkel-style weak supervision is a common way to combine multiple heuristic labeling functions into training labels. (Snorkel AI)
4) Do continued pretraining (domain-adaptive / task-adaptive) before NER fine-tuning
This step helps the model “get used to” your string distribution even without labels:
- DAPT/TAPT: continue masked-language-model pretraining on a large corpus of your extracted
value strings (hundreds of thousands to millions if possible), then fine-tune on your labeled NER set.
- This approach is supported by results showing in-domain continued pretraining improves downstream tasks, especially in low-resource settings. (arXiv)
5) Fine-tune for token classification correctly (label alignment matters)
Use standard Hugging Face token-classification training, but pay attention to:
- fast tokenizer alignment (
word_ids() / offsets),
-100 for special tokens and non-first subtokens,
- entity-level evaluation.
Hugging Face provides step-by-step guidance for NER fine-tuning and label alignment. (Hugging Face)
6) Add a small postprocessing layer (domain templates + ID suppression)
Even after training, UI strings benefit from guardrails:
- suppress ID-like tokens (
ALLCAPS, AAA-BBB, short codes),
- prioritize quoted substrings as ORG in certain templates,
- deduplicate overlapping spans.
This typically boosts precision a lot with minimal complexity.
If you want “more robust to weird tokens” models
If subword tokenization keeps hurting (lots of hyphens, codes, obfuscation artifacts), consider a tokenization-free / byte/character encoder for comparison:
- ByT5 (byte-level) and CANINE (character-level) are explicitly designed to be more robust to noise/tokenization issues. (arXiv)
These can be slower/longer-sequence, but they’re worth a small benchmark on your labeled subset.
Starter hyperparameters (sane defaults)
These are good initial values for DistilBERT-style token classification; tune based on validation F1.
| Setting |
Starting point |
| max_length |
128 (or 256 if values are long) |
| batch size |
16 (GPU) / 4–8 (CPU) |
| lr |
2e-5 to 5e-5 |
| epochs |
3–10 (small data → more, but watch overfit) |
| warmup_ratio |
0.06 |
| weight_decay |
0.01 |
| grad_clip |
1.0 |
Rule of thumb: if you have <5k labeled strings, continued pretraining + strong postprocessing often matters more than tiny LR tweaks.
Minimal code skeletons (continued pretraining → NER fine-tune)
A) Continued pretraining (MLM) on unlabeled value strings
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForMaskedLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments
model_id = "distilbert-base-cased"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
mlm = AutoModelForMaskedLM.from_pretrained(model_id)
# texts = [...] # your extracted value strings (already parsed/unescaped)
ds = Dataset.from_dict({"text": texts})
def tokenize(batch):
return tok(batch["text"], truncation=True, max_length=128)
ds_tok = ds.map(tokenize, batched=True, remove_columns=["text"])
collator = DataCollatorForLanguageModeling(tokenizer=tok, mlm_probability=0.15)
args = TrainingArguments(
output_dir="dapt_ckpt",
learning_rate=5e-5,
per_device_train_batch_size=32,
max_steps=2000,
weight_decay=0.01,
warmup_ratio=0.06,
logging_steps=50,
save_steps=500,
report_to=[],
)
trainer = Trainer(model=mlm, args=args, train_dataset=ds_tok, data_collator=collator)
trainer.train()
trainer.save_model("dapt_ckpt")
B) Fine-tune token classification (PER/ORG/LOC) on your labeled set
Follow the Hugging Face token classification recipe (label alignment is the key part). (Hugging Face)
from transformers import AutoModelForTokenClassification, DataCollatorForTokenClassification
labels = ["O", "B-PER","I-PER","B-ORG","I-ORG","B-LOC","I-LOC"]
label2id = {l:i for i,l in enumerate(labels)}
id2label = {i:l for l,i in label2id.items()}
model = AutoModelForTokenClassification.from_pretrained(
"dapt_ckpt", num_labels=len(labels), id2label=id2label, label2id=label2id
)
collator = DataCollatorForTokenClassification(tok)
# ds must contain tokens/words + ner_tags, or span annotations converted to BIO tags.
# Use tok(..., is_split_into_words=True) + word_ids() alignment per HF docs. (see citations)
args = TrainingArguments(
output_dir="ner_ckpt",
learning_rate=3e-5,
per_device_train_batch_size=16,
num_train_epochs=5,
weight_decay=0.01,
warmup_ratio=0.06,
logging_steps=50,
evaluation_strategy="steps",
eval_steps=200,
save_steps=200,
report_to=[],
)
trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=valid_ds, data_collator=collator)
trainer.train()
High-quality references (training + annotation + weak supervision)
- Hugging Face Transformers: token classification / NER fine-tuning guide (label alignment,
-100, subtokens). (Hugging Face)
- Hugging Face LLM course: token classification + alignment explanation. (Hugging Face)
- “Don’t Stop Pretraining” (DAPT/TAPT motivation and empirical gains). (arXiv)
- Argilla token classification tutorial + example fine-tuning workflow in a domain (legal). (Argilla Docs)
- Snorkel weak supervision overview + discussion around NER data creation. (Snorkel AI)
- Tokenization-free robustness options (ByT5 / CANINE). (arXiv)