Finding coarse highways in Openstreetmap

Quality control in Openstreetmap

Openstreetmap is a rich project with diverse users and entries. As such, the data base is rife with insufficient, contradictory or plain ugly entries. Some people would like not to have any entry there rather than one that might be an embarrassment. Others applaud the fact that there is an entry at all, and wait for crowd wisdom to set things straight. And there are others who build tools such as the OSM inspector that hint at possible problems so that helpful gnomes may take care of them.

One of the issues that occurred in the mailing list talk-de was roads that were drawn as coarse polygons, looking like a race track from the Disney movie Tron. Maybe somebody took very few trackpoints, maybe somebody was driving to fast, maybe somebody copied a road from a coarse satellite image or maybe somebody was just a bit lazy.

How to extract coarse highways

General features

Coarse highways are highways that have kinks, serpentines that zigzag instead of bend, long stretches of straight road. They look ugly on the map and are unlikely to reflect the real world. As such, they are characterised by triples of nodes showing

  1. sharp angles
  2. long sides.

Extracting those features

Looking at the tools people wrote to extract features off the OSM database I feel like ten years younger. Many things are written in Perl. My first experiences with XSLT and Saxon suggest that this may even be a wise choice. Perl is relatively fast at extracting information. Especially, a user known as Gary68 has written a Perl module and small sample applications of that tool. So I have written a tool that extracts all nodes belonging to motorway (link), trunk (link), primary, secondary, and tertiary highways and calculates the angle and the lengths of the sides of each triple of subsequent nodes. This Perl script I concocted does simply that. To proceed, do this:

  • Download the Perl module and save it under the name under an appropriate module path in a directory named OSM. On my Linux box, it sits under /usr/local/lib/perl/5.8.7/OSM/
  • Download the aforementioned Perl script.
  • Change the two variables in the lines below so that they point to an appropriate .osm file. In the example, I took the OSM files I got from unzipping the contents of mecklenburg-vorpommern.osm.bz2 from the Geofabrik server.
my $osmPath="$ENV{HOME}/osm/data";
my $fileName="mecklenburg-vorpommern.osm";
  • Run the script and redirect the standard output to a new text file. The file I obtained based on the current download has 145642 lines.

How to look at the coarse highways

General idea

We don't have a good concept of what is wheat and what is chaff yet. Therefore we want to look at the representations, sift through some examples to get a general idea. This is the domain of exploratory data analysis, or dynamic visualisation.

What was exported and why

The script does not exactly compute the angle and the length of the sides of each node triple. Instead, it computes the cosine of the angle and the product of the lengths. I decided to do this because I wanted the information on coarseness represented by two variables, and the sign of the angle (+12° or -12°) does not matter. To have the information concentrated in two variables allows me to scatterplot it. A scatterplot is capable of presenting a lot of visual information to the observer.

Starting the recommended program for exploratory data analysis

I downloaded Mondrian by Martin Theus, a program which is designed for visually analysing large data sets. It seems to be written as a refutation of the prejudice that Java programs have to be slow. The program is simply a *.jar file which can be started by something like java -jar path/to/Mondrian.jar. The program expects the statistical software package R running with the library Rserve. If it doesn't, well, Mondrian will cough a bit and take a bit longer to start. Don't worry about the warnings issued. Then, load the data by using File->Open.

Exploring the data

Firstly, we see that the data type of the IDs has mistakenly be assumed to be numeric, while we'd rather see it being a categorical variable. To change this, highlight the respective variable name and select Options->Switch variable mode. You needn't do that for the nodeID as it takes considerable time to update the mode (did I mention that Mondrian is fast? Wait and see) but it makes a lot of sense to do that for the wayID.

Then, we want to scatterplot the nodes by angle and lengths. Highlight the angle and lengths variables and select Plot->Scatterplot. The well-behaving node triples now sit in the bottom right corner. To get an idea of the concentration of nodes you can press <Left> repeatedly in order to decrease the alpha channel of the dots, but you should then really press <Right> to increase it as we want to single out the offending node triples.

Along the bottom, the sharp angles are seen towards the left. As the cosine has a low gradient for small angles, even small deviations from the right side are to be regarded. Along the right side you see triples with longer sides. In my example, the top node is placed at 11.2 km^2, which means a node where both adjacent edges have an average length of more than three kilometers (Ouch!).

In order to get an idea of where the nodes are on the map, a scatterplot of longitude vs latitude might be a good idea. So we select Plot->Scatterplot on the variables lat and lon. The reulting plot does not resemble Mecklenburg-Vorpommern very well, though. So we have to right-click the plot (or do the equivalent with a one-buttoned mouse) and select "fixed aspect ratio" and "flip axes". Now we can select offending points on the QC scatterplot and see where they roughly are on the map.

But which ways and nodes are affected? Let's highlight the variable "name" and select Plot->Barchart. Do the same thing with the variable "wayID". The resulting screen is quite crowded, but the windows can be placed against each other to allow a good overview even on my non-widescreen monitor. Mark some offending points on the scatterplot and select Sort by->absolute selected. You will get the road names or wayIDs which contain most of the offending points.


For example, the way represented by the ID 28724722 seems to have a problem. Let's select the way in the barchart and zoom in to the highlighted points on the map (using middle mouse button and rubber band).


This looks coarse indeed, but before we write an Openstreetbugs entry, we look at the way in the data base itself.


Apparently, the mapper was aware that the granularity is not up to OSM standards and has attached a FIXME tag already, so we don't really need to jump to Openstreetbugs now.


Quality control can be fun! Can it be done algorithmically instead? One would want to look for several adjacent node triples on a way with long sides and sharp angles. This can be put into an algorithm but a bit more experience and playing around is required. For this, tools like Mondrian (maybe also Spotfire or ggobi) are perfectly suited. Being able to sort by frequency of selected nodes quickly gets us the less than perfect ways at the top. Unfortunately, I haven't succeeded in making nodes directly accessible by a browser yet, so some copying, pasting and typing is required. But then we should look at all cases seperately anyway. After all, in the above example a coarse highway is not the most immediate problem on the map as long as some villages there aren't mapped at all. Some minor gripes remain, for example, the IDs are represented in scientific notation even when they are chosen to be on a categorical scale. But all in all Mondrian may be considered as a graphical quality control tool for several other OSM applications.

Add a New Comment
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License