To know everything about everywhere is a pretty substantial goal. To find every address in the United States could require having surveyed and found points for every address, and then re-surveying every time there’s an additional address.

When we search for a place that we don’t have a specific point for, we could possibly return another address on the same street (if we have a match there), there’s not much otherwise we can or should do.

Obviously, this is ideal for data quality, but far from ideal for a data management standpoint.

Interpolation is designed to fill this gap. By building on top of the road networks, which are far more available, interpolation allows forward and reverse geocoding without the need to have rooftop points for all possible places. It also ensures that new construction within the existing road networks can be found without needing to add additional points.

Importantly, interpolation can also play very well with point data, taking point data everywhere it can get it, and falling back to interpolation where it has to.

Here’s a roadmap to road interpolation that fits nicely into the architecture of Pelias we have today while tying nicely to the vision we have for its future.

Goals

Create an architecture for interpolation that MUST play well with authoritative point data from OpenAddresses and OpenStreetMap
Create an architecture for interpolation that SHOULD be able to work with authoritative interpolation data sets from governments but could otherwise interpolate from OpenStreetMap
MUST be capable of supporting forward and reverse geocoding in our existing architecture as well as autocomplete.
MUST work within the constraints of our existing architecture. No new databases. Use similar architecture to our existing importers.

Building It

Step 0: Research

There's not much we can do to implement interpolation without understanding the data that makes it possible.

There are several datasets that could work well for national interpolation.

Canada Road Network - available annually from Statistics Canada / Elections Canada. Major updates every 5 years (should be one for 2016). Note: needs high-precision transformation from NAND83 projection to WGS84. There's a version in English and a separate version in French. Rights restrictions to be researched.
US Road Network - TIGER/Line US road network from the US Census bureau - updated annually. No copyright restrictions. Used to add missing streets to OSM on an annual basis. They've even provided a handy guide to geocoding with TIGER.

There are probably others.

General Approach

Interpolation works from address ranges. The above data files chop up each street into several segments, which then have 2 address ranges, one for the left side of the street and another for the right side. These will tend to be odd/even divisions (though that's not always the case).

We'll adjust our Elasticsearch schema slightly to accommodate these address ranges and the even/odd flag.

We'll also store the multiple hierarchies for every place that the street segment can be in along with an unindexed polyline of its geometry.

              *
           BAR AVE
STREET:BAR AV |   STREET: BAR AV   
  FROMHN:121  |   FROMHN:120
    TOHN:139  |   TOHN: 138
  ZIP5:11111  |   ZIP5: 11112
              |
STREET:FOO RD |   STREET:FOO RD
  FROMHN:2    |   FROMHN: 10
   TOHN: 8    |   TOHN: 18
  ZIP5: 11111 |   ZIP5: 11112
*_____________*________________FOO ROAD____*
STREET:FOO RD |   STREET: FOO RD
  FROMHN:1    |   FROMHN:11
    TOHN:9    |   TOHN: 21
  ZIP5:11111  |   ZIP5: 11112
              |
STREET:BAR AV |   STREET: BAR AV
  FROMHN:121  |   FROMHN:120
    TOHN:139  |   TOHN: 138
  ZIP5:11111  |   ZIP5: 11112
              |
           BAR AVE
              *

Then when someone searches for an address, they'll likely get back an address from within that range block. If there's also an exact match for the address, GREAT! We give that to the user. Otherwise, we look at the ranged address, and in the API slice the street up by the number of addresses on the matching side (left or right), and return the approximate position. And we can be kind and put the point on the side of the street where the matching house number is, and use the house number for the query to generate the house number for the label.

For reverse geocoding, we'd use the record that's there for the street segment and just return the address range as the label (but only if there's no good matching point data). And because the geohash for the polyline would be so much smaller than other geometries, it'd still

Not a bad way to add ~400M addresses and complete coverage for all of North America.

Then, once we have this system working for US/Canada, we'd have the project of generating address ranges in OSM ourselves, but we could pull that together from any addresses that are tied to that street name. It'd require a more complex pre-processing than we do for any of the existing importers, but once we have it in the same format as Canada / TIGER, it could work worldwide.

Steps 1+2: Schema Modifications and Importers

We'll have to make adjustments to the schema to support the new data model. These are best covered by looking at what we'll have to import.

Importers

The importers for interpolation datasets should ...

Importantly, because these government sources only change once per year, we can streamline the import process by running the processing phase only once for each vintage, then can just bulk import into ElasticSearch the processed data every time after that (as we've done with the Quattroshapes import).

USA: TIGER/Line

TIGER data is somewhat complex to work with, but is well suited for working with our import process.

The best way to pick it up is from the Census Bureau's FTP server and the Address Feature data. There are 3220 zipped shapefiles we'll have to work with, but they cover all of the 50 US States, plus its territories (Guam, Diego Garcia airbase, Puerto Rico, and others).

There, we have for both the left and right sides of the road (decided as the left/right from the starting node -> ending node of a street segment) we have:

Address ranges (even or odd)
Zip 5
Zip 9

All of these attributes can be different for the left/right sides of the road. It would bear us well to either adjust the data model to accommodate there being different values for left and right sides of the street.

Often there will be different zip codes for different sides of a street (e.g. when a street is the dividing line between two towns).

There are also a non-trivial number of addresses that also follow these interpolation rules. If the interpolation range starts with a letter, keep that letter throughout the interpolation range (e.g. G099-G109 would include (G099, G101, G103, G105, G107, G109).

Canada:

The Canadian data is in many ways easier to work with. It comes in a single Shapefile.

There are, however, a few considerations. Namely these concern Canada's two national languages. The data comes in two flavors: English and French. We'll likely have to download both of them, then incorporate both the English and French names for each street as separate languages in each record (when we begin to deal with alt-names and multiple languages). Additionally, there are address expansions that are particular to Canada that we don't currently take into account. These are explained in detail in their Data Dictionary (found in the 2011 vintage of the data).

Step 3: API Changes

Analyzer

Honestly, lets just get LibPostal working here.

Business Logic; Post-Scoring (inclusion)

Once results come back, we'll have to do post-processing on interpolated ranges to make them usable.

We'll figure out what side of the street the matching house number is on for an interpolated record. Then we'll divide the length of the street segment by number of matches on that side (with some padding at the edges of the street) and insert the along the line segment at its approximate location.

We'll generate the label on the fly, from the authoritative name of the street, but dynamically for the house number from the parsed output of the query. We'll also want to flag in the response that it's an interpolated result.

No matter what, when we get interpolated results back, it'll mean making a set of judgements of whether to display it or not.

If there's an exact match for a point, return the point. Discard the interpolated result.
If there's no exact match for a point, but there's a matching street and range (or sufficient street match, if it's an autocomplete result), we use the interpolated results

We'll have to do post-processing on these interpolated results to make them usable.

Making it work for reverse geocoding

Provided we index the linestrings, this approach should work out of the box with reverse search. Mostly. We only include these results when an address layer is selected (all layers or an explicit address layer).

When a user inputs the coordinates for an address which is near an interpolated linestring, we can generate a label for the range like 10-28 Main Street, Anytown, USA, 12345. Ideally the point will be to one side of the linestring, which means we can determine if we respond with the left range or the right range. For the coordinate position, we connect the point to the closest point on the linestring.

e.g.

       user
        *
        |
--------*-------
      here!

This way, the point returned will match on the street, simplifying things for any further services (like routing) that might use it.

The Future: Making it work for OSM

OSM's going to be a BEAST to work with interpolation. But that doesn't mean we can't/shouldn't try.

With OSM data we'll need to extrapolate the points of interpolation ourselves. It'll only work, on road segments where there are already at least two addresses on a particular side of the street. We'll have to search through the OSM planet file to pre-process places where we know enough to interpolate from. It'll also change faster than the TIGER/Canada data.

But it also changes the expectations of OSM users. So far as I know, nobody has built native interpolation into an OSM geocoder from OSM's data. That could make it tremendously easier for folks to map out house numbers for regions that can't currently be geocoded, showing the unreasonable effectiveness of OSM making a difference through a small amount of open data, and incentivizing more people to collect this kind of data.

riordan/getting-there-thoughts-on-interpolation-in-pelias.md