Sunday, February 18, 2007

import zcta zip data to postgres postgis

Often you will receive data in a non-spatial form such as comma delimited data with latitude and longitude fields. To take full advantage of PostGIS spatial abilities, you will want to create geometry fields in your new table and update that field using the longitude latitude fields you have available.

General Note: All the command statements that follow should be run from the PgAdminIII Tools - Query Tool or any other PostGreSQL Administrative tool you have available. If you are a command line freak - you can use the psql command line tool packaged with PostGreSQL.

Getting the data

For this exercise, we will use US zip code tabulation areas instead of just Boston data. The techniques here will apply to any data you get actually.

First step is to download the data from US Census. http://www.census.gov/geo/www/gazetteer/places2k.html

Importing the Data into PostGreSQL

PostGreSQL comes with a COPY function that allows you to import data from a delimited text file. Since the ZCTAs data is provided in fixed-width format, we can't import it easily without first converting it to a delimited such as the default tab-delimited format that COPY works with. Similarly for data in other formats such as DBF, you'll either want to convert it to delimited using tools such as excel, use a third party tool that will import from one format to another, or one of my favorite tools Microsoft Access that allows you to link any tables or do a straight import and export to any ODBC compliant database such as PostGreSQL.

Create the table to import to

First you will need to create the table in Postgres. You want to make sure the order of the fields is in the same order as the data you are importing.


CREATE TABLE zctas
(
state char(2),
zcta char(5),
junk varchar(100),
population_tot int8,
housing_tot int8,
water_area_meter float8,
land_area_meter float8,
water_area_mile float8,
land_area_mile float8,
latitude float8,
longitude float8
)
WITHOUT OIDS;

Convert from Fixed-width to Tab-Delimited

For this part of the exercise, I'm going to use Microsoft Excel because it has a nice wizard for dealing with fixed-width and a lot of windows users have it already. If you open the zcta file in Excel, it should launch the Text Import Wizard. MS Access has a similarly nice wizard and can deal with files larger than excels 65000 some odd limitation. Note there are trillions of ways to do this step so I'm not going to bother going over the other ways. For non-MS Office users other office suites such as Open-Office probably have similar functionality.

  1. Open the file in Excel.
  2. Import Text Wizard should launch automatically and have Fixed-Width as an option
  3. Look at the ZCTA table layout spec http://www.census.gov/geo/www/gazetteer/places2k.html#zcta and set your breakouts the same as specified. For the above I broke out the Name field further into first 5 for zcta and the rest for a junk field.
  4. Next File->Save As ->Text (Tab delimited)(*.txt) -give it name of zcta5.tab
  5. Copy the file to somewhere on your PostGreSQL server.
  6. The COPY command

    Now copy the data into the table using the COPY command. Note the Copy command works using the PostGreSQL service so the file location must be specified relative to the Server.


    COPY zctas FROM 'C:/Downloads/GISData/zcta5.tab';

    Creating and Populating the Geometry Field

    Create the Geometry Field

    To create the Geometry field, use the AddGeometryColumn opengis function. This will add a geometry field to the specified table as well as adding a record to the geometry_columns meta table and creating useful constraints on the new field. A summary of the function can be found here http://postgis.refractions.net/docs/ch06.html#id2526109.

    SELECT AddGeometryColumn( 'public', 'zctas', 'thepoint_lonlat', 4269, 'POINT', 2 );

    The above code will create a geometry column named thepoint_longlat in the table zctas that validates to make sure the inputs are 2-dimensional points in SRID 4269 (NAD83 longlat).

    Populate the Geometry Field using the Longitude and Latitude fields


    UPDATE zctas
    SET thepoint_lonlat = PointFromText('POINT(' || longitude || ' ' || latitude || ')',4269)

    The above code will generate a Text representation of a point and convert this representation to a PostGis geometry object of spatial reference SRID 4269.

    There are a couple of things I would like to point out that may not be apparently clear to people not familiar with PostGreSQL or PostGis

    • || is a string concatenator. It is actually the ANSI-standard way of concatenating strings together. In MySQL you would do this using the CONCAT function and in Microsoft SQL Server you would use +. Oracle also uses ||. So what the inner part of the code would do is to generate something that looks like POINT(-97.014256 38.959448).
    • You can't just put any arbitrary SRID in there and expect the system to magically transform to that. The SRID you specify has to be the reference system that your text representation is in.

    Transforming to Another spatial reference system

    The above is great if you want your geometry in longlat spatial reference system. In many cases, longlat is not terribly useful. For example if you want to do distance queries with your data, you don't want your distance returned back in longlat. You want it in a metric that you normally measure things in.

    In the code below, we will create a new geometry field that holds points in the WGS 84 North Meter reference system and then updates that field accordingly.


    SELECT AddGeometryColumn( 'public', 'zctas', 'thepoint_meter', 32661, 'POINT', 2 );

    UPDATE zctas
    SET thepoint_meter = transform(PointFromText('POINT(' || longitude || ' ' || latitude || ')',4269),32661) ;

    Index your spatial fields

    One of the number one reasons for poor query performance is lack of attention to indexes. Putting in an index can make as much as a 100 fold difference in query speed depending on how many records you have in the table. For large updates and imports, you should put your indexes in after the load, because while indexes help query speed, updates against indexed fields can be very slow because they need to create index records for the updated/inserted data. In the below, we will be putting in GIST indexes against our spatial fields.


    CREATE INDEX idx_zctas_thepoint_lonlat ON zctas
    USING GIST (thepoint_lonlat);

    CREATE INDEX idx_zctas_thepoint_meter ON zctas
    USING GIST (thepoint_meter);

    ALTER TABLE zctas ALTER COLUMN thepoint_meter SET NOT NULL;
    CLUSTER idx_zctas_thepoint_meter ON zctas;

    VACUUM ANALYZE zctas;

    In the above after we create the indexes, we put in a constraint to not allow nulls in the thepoint_meter field. The not null constraint is required for clustering since as of now, clustering is not allowed on gist indexes that have null values. Next we cluster on this index. Clustering basically physically reorders the table in the order of the index. In general spatial queries are much slower than attribute based queries, so if you do a fair amount of spatial queries, you get a huge gain.

    In the above we vacuum analyze the table to insure that index statistics are updated for our table.

No comments: