Wednesday, 24 August 2011

Matching IP addresses, cities, geolocations


On xEvents, users search events in part based on their locations. Their locations are specified by submitting users as postal addresses. In general, all we know about our visitors are their IP addresses or the city where they live. Given these basic data, how do we enable users to easily find events taking place near them?

First we have to make sure that the cities of users and events can be cross-referenced. To achieve this we require users to pick their cities or the cities of their events from a list of cities derived from the database available at geonames.org. This is a free database of geographic landmarks (including towns of all sizes) which comes with latitudes and longitudes. Given these constraints, we're able to calculate distances between all our events and all the users who have specified their city in their profiles.

Things get more complicated when we want to offer a good default city for a newly registered user, or just to guess the city of an unregistered user to provide a good search default. For that we have to rely on a database which matches IP ranges with city names and latitudes/longitudes. After some research we've settled on the IPLigence database because it was the cheapest that seemed to have all the features we needed.

Unfortunately, it's not good enough for us to get the latitude and longitude of an anonymous user / IP address: we need to know which city they are in, because some of our searches have criteria like “same city”. So we've had to somehow align the city labels provided by IPLigence with those we use for events and registered users. (We couldn't use the IPLigence labels everywhere because the formatting of city names is not good in their data. They also count as cities many administrative regions which are not, for example, many London boroughs are represented as cities in this database. Geonames is much better organised in this regard.)

So we've had to match the city labels in the IPLigence database with the city labels in the Geonames database. This turned out to be more complicated than initially expected. The general algorithm is to take every IPLigence city label and find the best match in the Geonames database. Here's how we find the best match. First we look to see if there is only one entry in Geonames with the same country code and city name. If so, problem solved, we've got our best match. If there is no match, we abandon that city label (this is very rare). What is common and problematic is multiple matches. In this case, we pick the nearest match based on the coordinates that we have both in the IPLigence and the Geonames database.

We will follow up soon with another post about how to do geographic distance calculations with MySQL and Hibernate efficiently.  
 
Copyright David Bourget and University of London, 2011. This blog's content is license under the Attribution-ShareAlike license.