Thursday, November 11, 2010

Poor man's geocoder

Many addresses can be interpolated from the Census Bureau's TIGER/Line data.  Unfortunately, the TIGER database is woefully incomplete, right now, though the LUCA program may add a bunch of data for us.  Expect to see that release (the 2010 shapefiles) in March, 2011.  In my testing, I was able to assign tlids to anywhere from 60% to 85% of my input of DPV-confirmed addresses, so if you're looking for better coverage...  this solution is not for you.

The rough steps to geocode an address are these:

1)  Get the appropriate TIGER/Line shapefile from the Census.  You need to get the county All Lines file, as this is the one which has points with latitude and longitude.
2)  Convert the shapefile for easier access (optional); rather than deal with learning a GIS system, such as GRASS, I opted to load the shapefile data into MySQL.  I used PERL and the CPAN library called Shape to parse the shapefiles into flat files that I could load into MySQL.  I went all relational and created two tables - the first is called shape_record, and is essentially unique to TIGER/Line ID (tlid).  This table contains the ZIP Codes, street names, and address range information.  The second table is shape_detail, is related to the shape_record on tlid.  shape_detail contains each of the vertices of the TIGER/Line along with their latitude and longitude.
3)  Fairly important step, here, is to standardize the data on each side of your operation.  The field with the street name in it has some irregularities, and your input address may also need some cleaning up.
4)  Enter the shape_record table looking for your street name and ZIP code.  Narrow the result set to those tlids which have addresses which surround yours.  Note that you may have more than one tlid for a single input address.
5)  Enter the shape_detail table placing these records (generally at least two) into an array.  These are the points of your street.  Interpolate as appropriate - this part is easier expressed in code, so here's my hack-job in PHP:

        $res = mysql_query( $sql, $conn );
        if ( $res )
        {
                while ( $row = mysql_fetch_assoc( $res ) )
                {
                        $verts[$row['vertid']] = array( 'lat' => $row['lat'], 'lng' => $row['lng'] );
                }
                mysql_free_result( $res );

                // get the line segments;
                //calc total line length;
                $lines = array();
                for ( $i = 1; $i < sizeof( $verts ); $i++ )
                {
                        $lines[] = array(
                                'length' => sqrt(
                                        ( abs( $verts[$i]["lng"] - $verts[$i-1]["lng"] ) * abs( $verts[$i]["lng"] - $verts[$i-1]["lng"] ) )
                                        +
                                        ( abs( $verts[$i]["lat"] - $verts[$i-1]["lat"] ) * abs( $verts[$i]["lat"] - $verts[$i-1]["lat"] ) )
                                ),
                                'segments' => 0,
                                'verta' => $i - 1,
                                'vertb' => $i
                        );
                }
                $streetlen = 0;
                for ( $i = 0; $i < sizeof( $lines ); $i++ )
                {
                        $streetlen += $lines[$i]['length'];
                }

                // calculate the number of house-points ( tohn - fromhn )
                $segments = abs( $_REQUEST['fromhn'] - $_REQUEST['tohn'] );
                $seglen = $streetlen / $segments;
                $togo = abs( $_REQUEST['fromhn'] - $_REQUEST['number'] );

                for ( $i = 0; $i < sizeof( $lines ); $i++ )
                {
                        $lines[$i]['segments'] = $lines[$i]['length'] / $seglen;
                        $togo -= $lines[$i]['segments'];
                        if ( $togo < 0 )
                        {
                                $togo += $lines[$i]['segments'];
                                $selected_line = $i;
                                break;
                        }
                }

                //now we have a two points, establishing a line, and a third value which represents a vector along the line from the first point
                //calculate the point at the end of the vector.
                $longdiff = abs( $verts[ $lines[$i]['verta'] ]['lng'] - $verts[ $lines[$i]['vertb'] ]['lng'] );
                $latdiff = abs( $verts[ $lines[$i]['verta'] ]['lat'] - $verts[ $lines[$i]['vertb'] ]['lat'] );

                if ( $verts[ $lines[$i]['verta'] ]['lng'] > $verts[ $lines[$i]['vertb'] ]['lng'] )
                {
                        //going west
                        $longoper = -1;
                }
                else
                {
                        //going east or nowhere
                        $longoper = 1;
                }

                if( $verts[ $lines[$i]['verta'] ]['lat'] > $verts[ $lines[$i]['vertb'] ]['lat'] )
                {
                        //going south
                        $latoper = -1;
                }
                else
                {
                        //going north or nowhere
                        $latoper = 1;
                }


                if ( 0 == $longdiff * $latdiff )
                        $angle_a = 0;
                else
                        $angle_a = atan( $longdiff / $latdiff );
                $angle_b = M_PI - ( ( .5*M_PI ) + $angle_a );
                $angle_c = .5 * M_PI;
                $togo *= $seglen; //this is the hypotenuse
                $side_b = $togo * sin( $angle_b ); //this is our new latitude offset
                $side_a = $togo * sin( $angle_a ); //this is our new longitude offset

                printf( "maps.google.com/maps?z=17&q=loc:%f+%f\n",
                        $verts[ $lines[$i]['verta'] ]['lat'] + ( $side_b * $latoper ),
                        $verts[ $lines[$i]['verta'] ]['lng'] + ( $side_a * $longoper )
                );

       } 

Yes, I used straight trig, not spheroid geometry - we're talking about distances of hundreds of feet, here, so I decided that the curvature of the earth wasn't relevant.  In my test cases, I was generally within 100 feet of the actual location, and this is mostly due to the inaccuracies within the TIGER/Line data.  For example, a tlid might have a start address of 8500 and an end address of 8799, but the first house number is actually 8601.  This particular error might geocode the house a football field away from where it really sits.

No comments: