Unstructured textual postal address parsing to JSON

Hi all!

First post here. Not sure about anything so I’m just gonna ask giving my assumptions, since I’m quite new to this stuff (don’t even know how to phrase it, haha).

Anyways, I’ve been a software engineer for over 10 years but this is the first time that I think I might be in need of a NLP system.

The problem:
Unstructured free form address which we try to parse based on variety of rules but it’s just not accurate enough.

So the flow is this: user inputs an address line, we in the backend parse it so a standardised JSON structure.

I saw some tried to do an NLP to do this, but we’d like to build it on our own.

Any directions would be appreciated how to approach this, thank you.

About the only option is Libpostal, which uses a CRF model trained on 10s of millions of addresses. Nothing else compares. Senzing just released an improved model for it. There is a Python client library, but you must build the package from C. Everyone and their uncle working with addresses uses it.

GitHub: GitHub - openvenues/libpostal: A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
New model announcement: Libpostal Data Model From Senzing on GitHub