diff --git a/docs/conflation.md b/docs/conflation.md new file mode 100644 index 00000000..eefb29e1 --- /dev/null +++ b/docs/conflation.md @@ -0,0 +1,143 @@ +## Conflation + +Pelias imports address data from many different sources; some contain line geometry while others contain address points or interpolation ranges. + +None of our source data sets currently contain street<>address concordances, grouping data is necessary in order to establish the data refers to the same entity/entities. + +It's important to be able to combine these data in order to: + +- increase our street address coverage +- reduce 'holes' in the data which can cause a loss of precision +- associate point data to road network data +- deduplicate results + +This document outlines the different types of data we import and suggests some grouping techniques we can use to associate them. + +### Conflation for search + +Conflation for search is a similar process as for routing and display, the only major difference is that we aim to reduce the amount of duplicate street names to a minimum. + +Returning a list of osm ways is not acceptable for search as it will result in an experience such as: + +![conflation-problem](http://missinglink.embed.s3.amazonaws.com/search-conflation-problem.png) + +A better user experience would be to provide 'address ranges' of the street: + +![conflation-problem2](http://missinglink.embed.s3.amazonaws.com/search-conflation-problem2.png) + +An ideal experience would be to provide the exact street address: + +![conflation-problem3](http://missinglink.embed.s3.amazonaws.com/search-conflation-problem3.png) + +### Grouping Openstreetmap entities + +Conflation of openstreetmap entities is a well studied domain; both the routing team and the vector tiles team members have extensive experience in this area. + +#### Single line + +The most basic streets in Openstreetmap are an ordered collection of 'nodes' grouped together in a 'way', the street is given a `name` tag and a `highway:*` tag: + +| geometry | tags | +|:-:|:-:| +| ![osm-street-simple](http://missinglink.embed.s3.amazonaws.com/osm-street-simple.png) | ![osm-street-simple-tags](http://missinglink.embed.s3.amazonaws.com/osm-street-simple-tags.png) | + +These entities are relatively simple to import as they do not contain multiple line segments, some care will need to be taken when computing the centroid value; which should lie on the line string rather than in the center of the bounding-box. + +#### Multiple lines + +More complex roads require the road be split up in to 2 or more 'ways', such as this example, we have a road split in to 3 different 'ways'; two road segements with a roundabout in the middle: + +| geometry | tags | +|:-:|:-:| +|![osm-street-multiple-1](http://missinglink.embed.s3.amazonaws.com/osm-street-multiple-1.png) | ![osm-street-multiple-1-tags](http://missinglink.embed.s3.amazonaws.com/osm-street-multiple-1-tags.png) | +|![osm-street-multiple-2](http://missinglink.embed.s3.amazonaws.com/osm-street-multiple-2.png) | ![osm-street-multiple-2-tags](http://missinglink.embed.s3.amazonaws.com/osm-street-multiple-2-tags.png) | +|![osm-street-multiple-3](http://missinglink.embed.s3.amazonaws.com/osm-street-multiple-3.png) | ![osm-street-multiple-3-tags](http://missinglink.embed.s3.amazonaws.com/osm-street-multiple-3-tags.png) | + +These entities are usually related by common 'nodes', usually at the extremes of the constituent line segments. + +In some cases the line string may be 'broken', it may not share common nodes due to an obstacle such as an intersection or another feature. + +A spatial search can be performed to attempt to find other nodes in close spatial proximity which belong to a way which share the same name. + +Some caveats to avoid are: + +1. some cities have two or more roads with the same name, joining them would be incorrect. +2. there is potential for different spelling between ways, a street name normalization function should be used to detemine if the names 'match' or not. + +Again, some care must be taken when computing a centroid value for the network. There is potential for the joined road network to be very long, it may span several different neighborhoods (or even states!). + +There will be cases where a single centroid value doesn't make sense. It might be best to split these entities in to 2 different road networks in order to provide a more intuitive search to the user. + +#### Disjoined lines + +The most complex case is when two parts of are road a broken up by large spatial gaps, this is fairly common in large cities where building development has divided existing streets. + +A good example is Golden Gate Park in San Francisco, here you can see that all the north/south avenues are completely disjoined by the park: + +![golden-gate](http://missinglink.embed.s3.amazonaws.com/golden-gate-park.png) + +New York city has adopted an convention for these disjoined streets, usually prefixing their names with either 'East' or 'West' in order to disambiguate the two sides of the park. + +Parks are only the 'tip of the iceberg' regarding disjoined road networks, there are very long and complex networks just as major highways to consider as well as cases where roads change names and then change back again. + +It's also common for road networks to be disjoined multiple times, such as in [this example](https://gist.github.com/missinglink/564835c5465bf83dac9056d77da9c529). + +There is much more complexity to this problem than covered here, again great care must be taken when computing a centroid value for these networks as: + +1. a bounding box centroid would result in the centroid being inside the obstacle rather than on the road network. +2. the road network can be very large, spanning multiple cities/states or countries! + +![oresund-bridge](http://missinglink.embed.s3.amazonaws.com/oresund-bridge.png) + +#### Irregular geometries + +The world is a weird and wonderful place, it's best not to assume anything about how road networks are constructed, there will always be unusual geometries to be found, such as this: + +![earls-court-sq](http://missinglink.embed.s3.amazonaws.com/earls-court-square.png) + +It's best not to dwell on these unusual geometries as they are the exception rather than the rule. + +### Linking point data to the road network + +None of the address point data we source contain road network concordances, not even Openstreetmap! + +In the image above you can see that the house numbers are not positioned exactly on the road network itself. It's very uncommon to find a house which sits exactly on the street, there is usually a sidewalk or driveway which offsets the distance from the road network. + +#### Openstreetmap nodes + +For this reason, the OSM entity tagged with `addr:housenumber` rarely shares a common node with the road network; moreover the building way rarely shares a common node with the road network: + +| geometry | tags | +|:-:|:-:| +|![earls-court-51-node](http://missinglink.embed.s3.amazonaws.com/earls-court-51-node.png) | ![earls-court-51-node](http://missinglink.embed.s3.amazonaws.com/earls-court-51-node-tags.png) | +|![earls-court-51-way](http://missinglink.embed.s3.amazonaws.com/earls-court-51-way.png) | ![earls-court-51-way](http://missinglink.embed.s3.amazonaws.com/earls-court-51-way-tags.png) | + +In order to group these street numbers with the road network to which they belong, we must use the same technique as discussed above: + +Using a combination of spatial distance and linguistic similarity we can; with a degree of confidence; establish to which road segment the street number belongs to. + +#### Openstreetmap ways + +Similar to the above; except in this case the `addr:housenumber` tag has been applied to the way itself rather than a node: + +| geometry | tags | +|:-:|:-:| +|![192-finborough-rd](http://missinglink.embed.s3.amazonaws.com/192-finborough-rd.png) | ![192-finborough-rd](http://missinglink.embed.s3.amazonaws.com/192-finborough-rd-tags.png) | + +#### Openstreetmap interpolation lines + +Openstreetmap contains invisible 'interpolation lines'; these ways group a range of addresses with a guide path which shows where the missing house numbers should lie: + +| geometry | tags | +|:-:|:-:| +|![wetherby-mansions-way](http://missinglink.embed.s3.amazonaws.com/wetherby-mansions-way.png) | ![wetherby-mansions-way](http://missinglink.embed.s3.amazonaws.com/wetherby-mansions-way-tags.png) | +|![wetherby-mansions-node2](http://missinglink.embed.s3.amazonaws.com/wetherby-mansions-node2.png) | ![wetherby-mansions-node2](http://missinglink.embed.s3.amazonaws.com/wetherby-mansions-node2-tags.png) | +|![wetherby-mansions-node1](http://missinglink.embed.s3.amazonaws.com/wetherby-mansions-node1.png) | ![wetherby-mansions-node1](http://missinglink.embed.s3.amazonaws.com/wetherby-mansions-node1-tags.png) | + +These lines can likely be processed after the nodes, if the nodes have already been associated to a road segment then that information can simply be copied to the interpolated address points. + +#### Openaddresses and other point-only address datasets: + +Other datasets which only contain point data can use the same process to create concordances between their house numbers and the road network: + +![os-extract.png](http://missinglink.embed.s3.amazonaws.com/oa-extract.png) diff --git a/docs/design-doc.md b/docs/design-doc.md new file mode 100644 index 00000000..84c923b3 --- /dev/null +++ b/docs/design-doc.md @@ -0,0 +1,118 @@ + +## Interpolation + +This document outlines a proposal for refactoring how street addresses are stored and retrieved in Pelias. + +> Se also: [An introduction to addresses in Pelias](introduction.md) + +The strategic goals of the work are: + +- Ensuring every street in Openstreetmap is indexed and retrievable. +- Supporting address ranges as provided by Openstreetmap, TIGER, et al. +- Combining and de-duplicating distinct address point data sets. +- Designing the system to scale beyond 1B address points. +- Allow room for future extension / improvements. + +These changes will allow for the following user experience improvements: + +- Provide house number interpolation where address range data exists. +- Fall back to providing a street centroid in lieu of a satisfactory house number. +- Reduce noise by only showing a maximum of one result per street. + +### Source data + +The work requires a conflated `road network` dataset and one or more `house number` datasets in order to function. Additional house number data sets will improve the coverage and accuracy of the system. + +> See also: [The problem of conflation outlined in more detail](conflation.md) + +### Road network + +It is essential to have a pre-processed and conflated road network in order to: + +- Reduce / avoid duplicate street names in results. +- Provide line strings which can act as interpolation guides. +- Ensure that interpolated points do not lie in a driving hazard. + +Three strategies for conflating the OSM road network were considered: + +- Create a new system to conflate the OSM ways *- too time consuming and error prone*. +- Extract data from vector tiles *- not appropriate due to file size optimizations and entity merging*. +- Utilize the routing graph *- similar domain, not concerned with any entities except roads*. + +#### Exporting the Valhalla routing graph + +![generic graph image](http://i.stack.imgur.com/JrBdQ.png) + +Valhalla doesn't store OSM ways, it breaks up the source data in to a graph of 'edges'. Each edge is marked up with [meta data such as this](https://gist.github.com/missinglink/b2ac67f51d132b591868a9ef60061c43). + +By [iterating over all the tiles](https://github.com/valhalla/tools/issues/60) we can walk the graph and join adjacent road segments with the same name. + +The process would take around 2 hours and would produce a dump file containing: + +- a single continuous line string representing the geometry of the road +- the street name +- (optionally) meta data about the street such as direction +- (optionally) a centroid value for the street + +**note:** the algorithm will favour the longest contiguous path, in the case of disjoined streets and geometries that cannot be represented using a single line; a second line will be produced with the same street name. + +Future work can be planned in v2 to: + +- Improve the name matching algorithm. +- Reduce the number of duplicate line segments produced. +- Break line strings on geographic / political boundaries. + +### Point data + +![generic house number image](http://wiki.openstreetmap.org/w/images/f/f2/Housenumber_example_kms_2.png) + +Point data from Openstreetmap and OpenAddresses will need to be associated to the correct segment of the road network (a single entry from the dump above). + +Given only a lat/lon pair and a street name, the system must be able to quickly find the appropriate road network segment and retrieve a unique ID representing it. + +As above, the quality of the street normalization and spelling error detection will affect the quality of the matching algorithm. + +The house number point should then be [projected on the line string](http://stackoverflow.com/questions/10301001/perpendicular-on-a-line-segment-from-a-given-point), this will give us a new point which is guaranteed to lie on the line string. + +The projected point data is saved along with the original position, one will be used for exact matches while the other will provide interpolation data. + +### Range data + +![range](http://missinglink.embed.s3.amazonaws.com/osm-interpolation-tag.png) + +Range data from Openstreetmap and TIGER will also need to be associated with the correct segment of the road network. + +> See also: [Information about existing interpolation range standards](existing-standards.md) + +There is a performance vs. index size tradeoff that can be made here, either the range data can be 'expanded' at index time or at query time. It seems that query time expansion of ranges would be preferable as it keeps the index smaller and allows behavioral modifications without a full reindex. + +Judging by the tag statistics in the link above we will get much more value from supporting `addr:interpolation:*` tags than TIGER `from_address_right` etc tags. I would recommend not supporting TIGER tags in the v1 work. + +The OSM interpolation tags simply join two of the points mentioned above, so importing these ranges is very easy, simply associate them to the same road network as the child points, no further projection need be performed at this time. + +### Importing in to Pelias + +Pelias will require an import of one document per line string in the road network, this will be in the ten-of-millions. + +This data should be imported in to a new layer, named `street` which distinguishes it from `address` and `venue` data. + +Each record will require a centroid, if one is not provided in the source data then it will need to be computed. + +The line strings should not be stored in Elasticsearch, they have the potential to be very large (10's of GB). + +### Query logic + +``` +Can we determine candidate street(s) based on the input text? +[no] Fail + +Did the user provide a house number in the query? +[no] Return the street centroid + +Do we have point data for the requested house number? +[yes] Return the exact position + +Do we have an address range encompassing the requested house number? +[yes] Return the interpolated position +[no] Return the street centroid +``` diff --git a/docs/existing-standards.md b/docs/existing-standards.md new file mode 100644 index 00000000..c67cd711 --- /dev/null +++ b/docs/existing-standards.md @@ -0,0 +1,150 @@ +## existing standards + +### TIGER + +![tiger](https://www.census.gov/main/www/img/smalltiger.gif) + +The US census bureau have produced an export of street centerline coverage in the United States, Puerto Rica and the Island Areas since 1989, new versions of the data are released periodically on [their website](https://www.census.gov/geo/maps-data/data/tiger-line.html). + +The TIGER/Line files contain a schema for describing address ranges: + +> 5.12.4 Address Ranges + +> Linear address range features and attributes are available in the following layer: +'Address Range Feature County-based Shapefile' + +> The address range feature shapefile contains the geospatial edge geometry and attributes of all +unsuppressed address ranges for a county or county equivalent area. The term "address range" +refers to the collection of all possible structure house numbers between the first structure house +number to the last structure house number of a specified parity along an edge side relative to the +direction in which the edge is coded. All of the TIGER/Line address range files contain potential +address ranges, not individual addresses. Potential ranges include the full range of possible +structure numbers even though the actual structures may not exist. Single-address address ranges +are suppressed to maintain the confidentiality of the addresses they describe. + + The most relevant properties include (but may not be limited to): + +|property|description| +|---|---| +|LFROMHN|From House Number associated with the address range on the left side of the edge; SIDE=L| +|LTOHN|To House Number associated with the address range on the left side of the edge; SIDE=L| +|RFROMHN|From House Number associated with the address range on the left side of the edge; SIDE=R| +|RTOHN|To House Number associated with the address range on the left side of the edge; SIDE=R| +|PARITYL|Left side Address Range Parity| +|PARITYR|Right side Address Range Parity| + +For reference, this is how they define terms in their glossary: + +> 4.1 Edge + +> A linear object (topological primitive) that extends from a designated start node (From node) and continues to an end node (To node). An edge’s geometry can be described by the coordinates of its two nodes, plus possible additional coordinates that are ordered and serve as vertices (or "shape" points) between these nodes. The order of the nodes determines the From-To orientation and left/right sides of the edge. Each edge is uniquely identified by a TLID. A TLID is defined as a permanent edge identifier that never changes. If the edge is split, merged or deleted its TLID is retired. + +> 4.10 Parity + +> Parity is an attribute field in the addrfeat.shp used to indicate whether address house numbers +within an address range are Odd (O), Even (E), or Both (B) (both odd and even). + +note: for a full list of properties see page 79 of the [TIGER technical documentation](https://www2.census.gov/geo/pdfs/maps-data/data/tiger/tgrshp2013/TGRSHP2013_TechDoc.pdf). + +--- + +### Openstreetmap + +![osm](http://wiki.openstreetmap.org/w/images/thumb/7/7e/Logo_by_hind_128x128.png/120px-Logo_by_hind_128x128.png) + +A failed import of TIGER@2005 was attempted during 2005/2006, it was aborted and the data purged due to [data integrity problems](http://wiki.openstreetmap.org/wiki/Old_TIGER_Import_2005/2006). + +The first [successful import](http://wiki.openstreetmap.org/wiki/TIGER_2005) was made in 2007/2008 using a some new ruby scripts, there is still some confusion around whether the importer Dave Hansen imported TIGER@2005 or TIGER@2006. + +The TIGER import still has a [long list of issues](http://wiki.openstreetmap.org/wiki/TIGER_fixup) and the wiki clearly says "It is unlikely that the TIGER data ever will be imported again." + +The TIGER import laid the foundations for the US road network in OSM and due to its lineage; inherited some of the TIGER/Line attributes mentioned above, a discussion of how those properties were mapped from one dataset to another can be [found in the OSM wiki](http://wiki.openstreetmap.org/wiki/TIGER_to_OSM_Attribute_Map) + +=== + +#### tiger:* + +For a full list of tiger tags in use see: [taginfo](http://taginfo.openstreetmap.org/search?q=tiger%3A), the most common metadata tags are: + +|total|tag|description| +|---|---|---| +|~13M|tiger:cfcc|[Census Feature Classification Code](http://wiki.openstreetmap.org/wiki/TIGER_to_OSM_Attribute_Map#TIGER_CFCC_to_OSM_Attribute_Pair) (CFCC). replaced by MTFCC in 2007| +|~13M|tiger:county|Name of County followed by the abbreviated state name| +|~12M|tiger:reviewed|no was set on all ways in the TIGER import. It is intended as a way of tracking TIGER fixup progress. However it does not succeed in this aim and is largely pointless| +|~6M|tiger:tlid|TIGER permanent edge ID| + +Additionally there are tags which provide a more granular representation of street names: + +|OSM Key|TIGER Field|Example| +|---|---|---| +|tiger:name|"#{fedirp} #{fename} #{fetype} #{fedirs}".strip|"NW Chester St S"| +|tiger:name_direction_prefix|fedirp|"NW", "Southwest"| +|tiger:name_base|fename|"Chester"| +|tiger:name_type|fetype|"Street", "Ave"| +|tiger:name_direction_suffix|fedirs|"S"| + +And tags which represent address ranges: + +|OSM Key|TIGER Field| +|---|---| +|from_address_right|fraddr| +|to_address_right|toaddr| +|from_address_left|fraddl| +|to_address_left|toaddl| +|zip_left|zipl| +|zip_right|zipr| + +For each additional address range: + +|OSM Key|TIGER Field| +|---|---| +|from_address_right_1|fraddr| +|to_address_right_1|toaddr| +|from_address_left_1|fraddl| +|to_address_left_1|toaddl| +|zip_left_1|zipl| +|zip_right_1|zipr| + +et cetera + +**note:** there are only ~1600 occurrences of `from_address_right` and `from_address_left` in OSM. see [taginfo](http://taginfo.openstreetmap.org/search?q=from_address_left). + +=== + +#### addr:interpolation:* + +This is a more global/general approach towards defining house number ranges in OSM. + +[The wiki](http://wiki.openstreetmap.org/wiki/Addresses#Using_interpolation) states that the tag value should be a numeric offset which indicates "the value to increment between house numbers" (so generally `2` in western countries which follow the 'zigzag' schema). + +There are also special cases where the tag value is either the string `odd` or `even` (there is also `all` but it is not clear what this means; presumably both). + +In this case `odd` is equivalent to having an offset `2` but also implies that the range starts at an odd number. The inverse is true for `even`. + +For a full list of `addr:interpolation` tags in use see: [taginfo](http://taginfo.openstreetmap.org/keys/addr:interpolation#values), the most common tag values are: + +|total|tag| +|---|---| +|~1M|addr:interpolation=even| +|~1M|addr:interpolation=odd| +|~18k|addr:interpolation=all| +|~1.6k|addr:interpolation=alphabetic| +|~200|addr:interpolation=4| +|~150|addr:interpolation=1| +|~100|addr:interpolation=2| + +It's natural to assume that the `addr:interpolation` tag is commonly added to ways which also contain a `highway:*` tag, looking at [taginfo](http://taginfo.openstreetmap.org/keys/addr%3Ainterpolation#combinations) shows us that only `643` of the ~2M interpolation tags actually belong to road network segments. + +The majority of `addr:interpolation` tags are 'invisible ways' which represent the path along which the houses will sit, on the base map they look like this: + +![basemap interpolation](http://missinglink.embed.s3.amazonaws.com/osm-interpolation-tag.png) + +In the example above the mapper has entered the first node and last note of the sequence, they tagged each with the `addr:street` and `addr:housenumber` tags and then joined the two with [a way](https://www.openstreetmap.org/way/251485113) which contains a single tag `addr:interpolation:odd`, the way is represented by a dashed line. + +This seems to be the most common use of the tag, in some cases the mapper is simply trying to cover a large amount of ground quickly, maybe they got tired of drawing the individual building outline for each house or maybe the area is still under construction, the wiki contains some additional tags for defining the [confidence of the survey](http://wiki.openstreetmap.org/wiki/Addresses#Using_Address_Interpolation_for_partial_surveys). + +In some areas the interpolation values have been replaced by individual building outlines, in other areas it's more common, such as: + +![basemap interpolation](http://missinglink.embed.s3.amazonaws.com/osm-interpolation-coverage.png) + +I would be interested in doing more research in to the amount of house numbers available in OSM using `addr:street` and `addr:housenumber` vs. using `addr:interpolation:*`. It would take some time to compute but would make a great blog post. diff --git a/docs/introduction.md b/docs/introduction.md new file mode 100644 index 00000000..3ebe164e --- /dev/null +++ b/docs/introduction.md @@ -0,0 +1,118 @@ + +# Improved address discovery + +This document dicusses address-range geocoding with Pelias in order to facilitate better address matching. + +## Our values + +Pelias is a modular, open-source geocoder built on top of ElasticSearch for fast geocoding. + +In order to encourage external contribution *and* to ease the burden of installing and maintaining Pelias we made some early design decisions: + +- It should be easy to install and require few external dependencies. +- It should be written in as few different programming languages as possible. + +These decisions have kept the software modular; easy to install; maintain and contribute to over the years. + +One of the compromises we made was not including a relational-database in our stack, if we required a PostgreSQL database containing an OpenStreetMap import, this would increase the complexity of installing, hosting and developing the software. + +This would also not be in-line with our vision of commoditizing geocoding, it would add significant financial costs to host and require developer time to maintain and develop, additionally it would likely focus the product around OpenStreetMap and would likely result in the domain logic handled by a variety of different tools in different languages. + +Javascript is the lingua franca of programming languages, we use it as much as possible, there are however one or two cases where we required another language to perform a specific task. In both cases [pbf2json](https://github.com/pelias/pbf2json) and [libpostal](https://github.com/openvenues/libpostal) have been wrapped in javascript bindings which required no external dependencies or compilers. + +We also rely on data exports provided by external parties, for example: openstreetmap, openaddresses and geonames provide 'data dumps' which contributors may utilize in their installation. libpostal likewise provides 'builds' which we can pull down and utilize. + +This separation of responsibility is core to our values, we consider it much more valuable to the community to have functionality externalized rather than internalized within a monolithic geocoding library. + +### Address coverage + +At time of writing we have ~285 million address points available to search from two different data sources: + +| source | total address points | +|---|---| +| openstreetmap | ~45M | +| openaddresses | ~240M | + +#### Openstreetmap + +Openstreetmap is an excellent source of road network data, the coverage of street addresses is fairly complete in some countries such as Germany and the USA, however it is sparse in other countries. + +There is an opportunity to increase the address points coverage provided by Openstreetmap by ~20M (estimate) using interpolation tags, this is covered later in this document; however it's clear that we will not be able to source a significant amount of global addresses from OSM. + +#### Openaddresses + +The amount of data being imported from `openaddresses` is still growing, it can increase by tens of millions of of addresses per month. + +### Known issues + +#### Street fallback + +In the case where no matching address is found for a search, we should fall back to the road network in order to return more general information about the street in leiu of the actual address point. + +We do not currently do this, if an exact address is not found, we simply attempt to return another address on the same street, if no addresses exist on the street we return nothing. + +An example is [this road](http://www.openstreetmap.org/way/34243335) which has street geometry but does not have any address points available. + +In this situation we cannot infer any numbering scheme or logically guess a number range so we will never be able to provide an exact house number. + +Pelias does not currently import street geometry from Openstreetmap due to the large amount of duplicate/invalid names. + +There is a [ticket](https://github.com/pelias/openstreetmap/issues/19) and a [repository](https://github.com/pelias/osm-featurelist-evaluation) dedicated to exploring the OSM features we bring in to Pelias, some extracts from that repo illustrate the issues with the data: + +> Note: this data was taken from an NYC metro extract of records containing the highway:* tag + +``` +node 2708075964 Lamp Post +node 2708075965 Lamp Post +node 2708075966 Lamp Post +node 2708075967 Lamp Post +node 2708075970 Lamp Post +node 2708075974 Lamp Post +node 2708075976 Lamp Post +node 2708075977 Lamp Post +node 2708075979 Lamp Post +node 2708075981 Lamp Post +node 2708075982 Lamp Post +node 2708075983 Lamp Post +node 2708075984 Lamp Post +node 2708075988 Lamp Post +``` + +``` +way 5709922 78th Street +way 5709923 78th Street +way 5709924 78th Street +way 5709925 78th Street +way 5709927 78th Street +way 5709928 78th Street +way 5709930 78th Street +way 5709931 78th Street +way 5709933 78th Street +way 5709934 78th Street +way 5709935 78th Street +way 5709936 78th Street +way 5709937 78th Street +way 5709938 78th Street +``` + +More examples: https://raw.githubusercontent.com/pelias/osm-featurelist-evaluation/master/cuts/highway.text + +Importing this duplicate content would bloat the index and cause user experience problems. + +The options of solving this issue are discussed later in this document in a section called 'clustering'. + +#### Interpolation + +In the case where we cannot find the exact house number *but* we know the street; we can attempt to 'infer' the position of the missing house number. + +Interpolation is based on numeric 'ranges' of house numbers, it's not as precise as exact locations but can greatly increase the address coverage in the search engine. + +![interpolation](http://missinglink.embed.s3.amazonaws.com/tiger-interpolation-basics.png) + +Interpolation can be complex due to the shape of the geometry, the offset from the road network, 'holes' in building ranges etc. This will be covered in more detail later in the document. + +When discussing interpolation it's important to remember that although the position returned is inexact, the value to an automobile driver is still high, in most cases guiding a driver within a ~20m of their target will allow them to visually guide themselves the rest of their way to the destination. + +Care must be taken in order to ensure that the destination point provided does not lie in a hazard. A naive linear interpolation between a group of points might center the destination point inside a building or water hazard, 'snapping' data to a road network should mitigate these issues. + +This issue is discussed in more detail later in this document in a section called 'interpolation'.