Innerkore Technologies | Technology Consulting & Web Development

This is Part 2 of a series on building an open-source Indian address parser. Part 1 covered fine-tuning Qwen3-0.6B with LoRA and our first benchmark against Shiprocket's open-tinybert-indian-address-ner. This post covers the third and final model in the series, and what happened when we pointed it at the exact model that inspired it.

The itch Shiprocket's benchmark left behind

When we first benchmarked our Qwen3-0.6B model against Shiprocket's open-tinybert-indian-address-ner, the headline was good news wrapped in a caveat: we won on every one of the nine conceptually-shared fields, sometimes by a wide margin — but Shiprocket's model was doing it with a 6-layer, 768-hidden BERT variant that ran in 19 milliseconds per address on CPU. Ours took over four seconds. That's not a rounding error; that's a 240x gap.

It's easy to wave that away — "different tradeoff, different use case" — and mostly that's true. But it kept nagging at us. Shiprocket had clearly made a deliberate choice: trade some accuracy for a model small enough to run in a hot path, cheaply, at scale. We'd built the opposite thing. Neither choice is wrong, but we only had one point on that curve.

So the natural next step wasn't "make Qwen faster." It was: what if we made the same architectural bet Shiprocket did, but trained it on our own gold-labeled data? Same idea — small BERT encoder, BIO tagging instead of JSON generation — applied to our 13-field schema instead of theirs.

That's huawei-noah/TinyBERT_General_4L_312D: 4 layers, 312 hidden dimensions, about 14 million parameters. For comparison, that's roughly 5x smaller than our flan-t5-small model and 40x smaller than the Qwen3-0.6B LoRA setup. It fine-tunes in about two minutes on a laptop.

The catch: our data wasn't built for this

Here's the thing nobody tells you when you decide "let's just add a BERT-style token classifier": your training data has to actually support it, and if you've been building a generative-model pipeline for months, it probably doesn't — not in the shape you need.

Every model in this project so far had been trained the same way: given a raw address string, generate a JSON object mapping 13 field names to substrings of that address. The gold labels were always verbatim extractions — never paraphrased, never normalized. If the source text said "Kamrup Unclassified AS 781029", the district field was exactly "Kamrup", copied character-for-character, not corrected to "Kamrup Metropolitan" or expanded to "Assam".

Token classification wants something different: a BIO tag (B-district, I-district, O, ...) on every single token. We didn't have that. What we had was JSON.

The good news is that "verbatim extraction" and "convertible to BIO tags" are almost the same property. If a gold value is a real substring of the raw address, you can find its character span, then map that span onto whatever tokens your tokenizer produces. We measured how often that actually holds:

exact substring found: 25,910 / 25,915 gold field values (99.98%)

Nearly perfect. The handful of misses were genuine data artifacts — cases where two fields got glued together during earlier normalization ("PALASHBARI Kamrup" spanning a comma that got collapsed somewhere upstream). Not something a training script should paper over; we just skip labeling those and move on.

The bug that almost shipped: duplicate values

The harder problem showed up once we started converting real examples. Consider an address where both city and district are gold-labeled as "Chandigarh" — genuinely happens, since Chandigarh is its own city and district. A naive raw_text.find("Chandigarh") always returns the first occurrence. Both fields collapse onto the same span. One of them silently loses its label.

We caught this by measuring, not by inspection — we ran a full round-trip test (gold → character spans → BIO tags → reconstructed fields) across the training set and it came back at 93.75% instead of the >99% we expected. Digging into the mismatches surfaced the collision pattern immediately: any two fields sharing a value, wherever the text happened to repeat that value, were fighting over the same characters.

The fix: track every occurrence of a value in the text, not just the first, and let the overlap-resolution logic (which already claims longer spans before shorter ones, so "village" doesn't get clobbered by a "locality" substring it happens to contain) pick a distinct occurrence for each field. That brought the round-trip ceiling up to 96.68% — which we now treat as this data's honest upper bound, not a bug to keep chasing. The residual gap is two well-understood cases: fields sharing a value with too few occurrences to give each one its own span, and tokens that straddle a span boundary on already-documented data artifacts (the same glued-substring issue that shows up elsewhere in this project's known limitations).

Training and the surprising result

With the BIO conversion pipeline verified, training itself was almost anticlimactic. Ten epochs, batch size 32, cosine learning rate schedule — done in 133 seconds on Apple Silicon. Eval loss dropped from 1.91 to 0.77 and plateaued cleanly around epoch 8.

Then evaluation came back and it was, frankly, better than expected:

Model	Params	Mean field accuracy
Qwen3-0.6B + LoRA	~596M	82.4%
flan-t5-small	~77M	80.6%
TinyBERT 4L/312D	~14M	78.8%

A model 40x smaller than our best one landed within four points of it. It's not free — subLocality and village recall are both effectively 0%, meaning the model just defaults to null on those far more than gold does, a real weakness worth being upfront about. But on the fields that matter most for downstream use (district, state, city, pincode), it's solidly in the same range as its much bigger siblings.

The comparison that actually mattered

All of this was interesting on its own, but it wasn't the real test. The real test was going back to the model that started this whole detour: Shiprocket's open-tinybert-indian-address-ner.

Same name. Same task family — BIO tagging on Indian addresses. This should be the closest thing to an apples-to-apples comparison in the whole project.

Except it wasn't, and we found that out the moment we checked the config instead of assuming:

shiprocket-ai/open-tinybert-indian-address-ner
  hidden_size: 768, num_hidden_layers: 6
  params: 66,382,103

Despite the "tinybert" name, Shiprocket's model is a 6-layer, 768-hidden BERT — closer in scale to BERT-base than to the original TinyBERT paper's smallest configuration. It has 4.7x more parameters than ours. We reported that up front rather than letting a same-name comparison imply a same-size one.

With that caveat stated plainly, we ran both models on the same 237-example held-out gold test set:

Field	Ours (4L/312D, ~14M)	Shiprocket (6L/768D, ~66.4M)
houseNumber	79.8%	27.1%
houseName	81.7%	72.1%
street	50.0%	27.0%
locality	36.5%	6.7%
city	82.6%	17.4%
state	84.2%	41.5%
pincode	99.2%	69.2%
poi	20.5%	10.3%
subLocality	0.0%	0.0%

We won on all nine shared fields. Not narrowly — on city it's 82.6% vs 17.4%; on houseNumber it's 79.8% vs 27.1%. And it wasn't slower for the privilege: 11ms/address vs 16ms/address, despite being the smaller model on paper.

We didn't take the win at face value either. When we inspected Shiprocket's raw, unaggregated per-token predictions, the pattern was clear: on longer administrative-suffix text, the model's tag predictions genuinely flip-flop mid-word, with confidence scores dropping to 0.3–0.5 exactly where that happens. For "Kamrup Unclassified", the token "Kam" gets tagged B-sub_locality at 0.45 confidence, and the very next token, "rup", gets tagged I-locality at 0.42 — genuinely uncertain, internally inconsistent output, not an artifact of how we ran the comparison.

Our read on why the gap is this large: fine-tuning on task-specific gold data seems to matter more here than raw parameter count. Shiprocket's model is bigger, but it wasn't fine-tuned on this exact 13-field taxonomy and this exact address distribution. Ours was — on the same 4,110 verbatim-extraction examples that trained the Qwen3 and flan-t5-small models before it.

Where this leaves the project

Three models now sit behind one interface:

from indian_address_parser import AddressParser

parser = AddressParser()                  # tinybert — the default now
parser = AddressParser(backend="t5")       # a couple points more accurate, slower
parser = AddressParser(backend="qwen")     # the most accurate, and the heaviest

TinyBERT became the default in v0.3.0 deliberately, not by default-by-omission. It's the cheapest model in the series to download and run — a single forward pass instead of autoregressive generation, no adapter/base-model split to manage — and it gives up only a few points of accuracy to do it. For most people integrating this into a pipeline, that's the right trade to start from; the other two backends are one keyword argument away when the accuracy matters more than the footprint.

All three models, the benchmark scripts, and the full per-field breakdowns are public:

Code & benchmarks: github.com/innerkorehq/indian-address-parser
TinyBERT model: huggingface.co/gagan1985/tinybert-4l-312d-indian-address-parser
PyPI: pip install indian-address-parser

If you want to reproduce the Shiprocket comparison yourself, it's a five-minute run:

pip install indian-address-parser transformers torch
git clone https://github.com/innerkorehq/indian-address-parser
cd indian-address-parser/benchmarks
python compare_tinybert.py --out results.json