Fine tune Distilbert-cased model for NER detection on app binaries string resources arsc data

I have extracted extrcated app strings or arsc resources data in the application using reverese engineered tools like andrguard , apk_tool etc
Here’s some sample data texts that I have on which model will do NER.

"<string name=\"ARKitCameraPermissionTitle\">Amazon Would Like to Access the Camera</string>"
"<string name=\"ARKitCellularDataTitle\">Cellular Data is Turned Off for \"Amazon\"</string>
<string name=\"aadhaar_otp_info_2\">AADHAAR-OTP will be auto read.</string>
<string name=\"amazon_pay_upi_txt\">Amazon Pay UPI</string>

I need to do NER on this type of highly noisy code like data, previosly I was using distilbert model for doing this but it performed very poorly on it becuase maybe the model is trained on free texts and has not seen so much of code like data before.

What approach should I take to make sure my model has this kind of texts context and does better NER especially in PERSON, ORGANIZATION and LOCATION categories

I saw that fine tuning on domain dataset helps in these scenrios, so is it correct for me to go ahead with custom domain based dataset ? or should I look at something else to improve model’s output on my example texts
also any sample code scripts , blogs and docs and things training tips related to fine tuning or approaches which I should use will be really appreciated. Anything related to best case hyperparameters to be used in. approach will also be helpful.

Thanks in advance

1 Like

Maybe it’s because that model weight is being trained on news articles? (Detailed version)


Why DistilBERT NER struggles on Android strings.xml-style text

Models like DistilBERT fine-tuned on CoNLL-2003 learn “news text” patterns. Android resource strings add systematic noise that CoNLL rarely contains:

  • Wrapper + key noise: <string name="..."> ... </string> and CamelCase keys produce false positives if you run NER on the raw line.
  • UI templates: “X would like to …”, “Turned off for “X”” contain Title Case phrases that look entity-like.
  • ID / token-y text: AADHAAR-OTP, uppercase codes, and punctuation-heavy tokens.
  • Subword boundary quirks: pipeline aggregation can hide tokenization artifacts; offset-based reconstruction is often more stable for char spans. (GitHub)

Recommended approach (practical path that usually works)

1) Normalize the input (biggest immediate precision gain)

Run the model on the extracted value only, not the raw XML-ish line. Keep name= as metadata if you want, but don’t feed it into the NER model by default.

This is also where you:

  • unescape \", XML entities,
  • strip outer quotes,
  • optionally drop obvious wrappers.

(If you later need name signals, treat it as a separate feature or a second-stage reranker.)


2) Create a small in-domain labeled dataset (yes, this is the core)

For your domain, there isn’t a public dataset that matches the noise/UI-template distribution, so a custom dataset is the correct direction.

Key points:

  • Define a labeling policy for ambiguous tokens (e.g., is AADHAAR-OTP an entity? usually it’s an ID/product token → often “not an entity”).
  • Include many hard negatives: UI nouns (“Camera”, “Location Services”, “Office”) that should not be ORG/LOC/PER.
  • Split train/valid/test by app/package (prevents leakage of app-specific brand strings).

Annotation tooling options:

  • Argilla supports span/token labeling workflows and has token-classification tutorials; Hugging Face also has practical posts on fine-tuning with annotated domain data. (Argilla Docs)

3) Bootstrap labels with weak supervision (to scale faster)

Before labeling thousands of strings manually, use high-precision heuristics to generate “weak labels”, then review/correct:

Examples of high-precision rules in your domain:

  • quoted app names: for "Amazon" → ORG
  • template heads: Amazon would like to ... → ORG
  • known brand list/gazetteer matches (Amazon, Google Pay, Microsoft, …) → ORG

Snorkel-style weak supervision is a common way to combine multiple heuristic labeling functions into training labels. (Snorkel AI)


4) Do continued pretraining (domain-adaptive / task-adaptive) before NER fine-tuning

This step helps the model “get used to” your string distribution even without labels:

  • DAPT/TAPT: continue masked-language-model pretraining on a large corpus of your extracted value strings (hundreds of thousands to millions if possible), then fine-tune on your labeled NER set.
  • This approach is supported by results showing in-domain continued pretraining improves downstream tasks, especially in low-resource settings. (arXiv)

5) Fine-tune for token classification correctly (label alignment matters)

Use standard Hugging Face token-classification training, but pay attention to:

  • fast tokenizer alignment (word_ids() / offsets),
  • -100 for special tokens and non-first subtokens,
  • entity-level evaluation.

Hugging Face provides step-by-step guidance for NER fine-tuning and label alignment. (Hugging Face)


6) Add a small postprocessing layer (domain templates + ID suppression)

Even after training, UI strings benefit from guardrails:

  • suppress ID-like tokens (ALLCAPS, AAA-BBB, short codes),
  • prioritize quoted substrings as ORG in certain templates,
  • deduplicate overlapping spans.

This typically boosts precision a lot with minimal complexity.


If you want “more robust to weird tokens” models

If subword tokenization keeps hurting (lots of hyphens, codes, obfuscation artifacts), consider a tokenization-free / byte/character encoder for comparison:

  • ByT5 (byte-level) and CANINE (character-level) are explicitly designed to be more robust to noise/tokenization issues. (arXiv)
    These can be slower/longer-sequence, but they’re worth a small benchmark on your labeled subset.

Starter hyperparameters (sane defaults)

These are good initial values for DistilBERT-style token classification; tune based on validation F1.

Setting Starting point
max_length 128 (or 256 if values are long)
batch size 16 (GPU) / 4–8 (CPU)
lr 2e-5 to 5e-5
epochs 3–10 (small data → more, but watch overfit)
warmup_ratio 0.06
weight_decay 0.01
grad_clip 1.0

Rule of thumb: if you have <5k labeled strings, continued pretraining + strong postprocessing often matters more than tiny LR tweaks.


Minimal code skeletons (continued pretraining → NER fine-tune)

A) Continued pretraining (MLM) on unlabeled value strings

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForMaskedLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments

model_id = "distilbert-base-cased"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
mlm = AutoModelForMaskedLM.from_pretrained(model_id)

# texts = [...]  # your extracted value strings (already parsed/unescaped)
ds = Dataset.from_dict({"text": texts})

def tokenize(batch):
    return tok(batch["text"], truncation=True, max_length=128)

ds_tok = ds.map(tokenize, batched=True, remove_columns=["text"])
collator = DataCollatorForLanguageModeling(tokenizer=tok, mlm_probability=0.15)

args = TrainingArguments(
    output_dir="dapt_ckpt",
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    max_steps=2000,
    weight_decay=0.01,
    warmup_ratio=0.06,
    logging_steps=50,
    save_steps=500,
    report_to=[],
)

trainer = Trainer(model=mlm, args=args, train_dataset=ds_tok, data_collator=collator)
trainer.train()
trainer.save_model("dapt_ckpt")

B) Fine-tune token classification (PER/ORG/LOC) on your labeled set

Follow the Hugging Face token classification recipe (label alignment is the key part). (Hugging Face)

from transformers import AutoModelForTokenClassification, DataCollatorForTokenClassification

labels = ["O", "B-PER","I-PER","B-ORG","I-ORG","B-LOC","I-LOC"]
label2id = {l:i for i,l in enumerate(labels)}
id2label = {i:l for l,i in label2id.items()}

model = AutoModelForTokenClassification.from_pretrained(
    "dapt_ckpt", num_labels=len(labels), id2label=id2label, label2id=label2id
)

collator = DataCollatorForTokenClassification(tok)

# ds must contain tokens/words + ner_tags, or span annotations converted to BIO tags.
# Use tok(..., is_split_into_words=True) + word_ids() alignment per HF docs.  (see citations)

args = TrainingArguments(
    output_dir="ner_ckpt",
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    warmup_ratio=0.06,
    logging_steps=50,
    evaluation_strategy="steps",
    eval_steps=200,
    save_steps=200,
    report_to=[],
)

trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=valid_ds, data_collator=collator)
trainer.train()

High-quality references (training + annotation + weak supervision)

  • Hugging Face Transformers: token classification / NER fine-tuning guide (label alignment, -100, subtokens). (Hugging Face)
  • Hugging Face LLM course: token classification + alignment explanation. (Hugging Face)
  • “Don’t Stop Pretraining” (DAPT/TAPT motivation and empirical gains). (arXiv)
  • Argilla token classification tutorial + example fine-tuning workflow in a domain (legal). (Argilla Docs)
  • Snorkel weak supervision overview + discussion around NER data creation. (Snorkel AI)
  • Tokenization-free robustness options (ByT5 / CANINE). (arXiv)