Sunday, July 15, 2012

PostGIS 2.0 - Intersect Raster and Polygon

I've just taken up the task to experiment with PostGIS 2.0 raster support. At this point in time, it seems that there is little documentation (by other folks) regarding raster analysis using PostGIS, so I figured I'd write down some of what I'm experiencing. First, raster support in PostGIS is HUGE!  I'm eternally thankful that ESRI is will no longer be my sole raster analysis tool (I know, there's more out there than just ESRI...).

The Data:
-One 10x10km grid cell from the NREL/Perez Direct Normal Irradiance grid
-90m Landscan Raster detailing population for greater Denver area from Oak Ridge National Lab

The Analysis Intent:
-Summarize population within grid cell, first using ESRI and then PostGIS and compare. Easy, no?

1. Load Some Data
First, loading the raster into PostGIS. This is pretty straight forward with one exception, tile size. What tile size do you use?  I would love to see theoretical or empirical analysis detailing the optimal tile size per given cell size and extent. Anyone? Until then, I'll guess.

-C : applies contraints, -M : vacuum analyze, -Y : use copy rather than insert, -t : tile size, -s : SRID, -U : username, -d : database
raster2pgsql -s 4326 -C -M -Y landscanden -t 100x100 ls_den | psql -U postgres -d postgis2

2. Get a Base Estimate
Get a base estimate using ESRI. Using Zonal Statistics I get the following:
Population: 207,578
Cell Count: 14,400

3. How Much Can We Explain?
If my results are different between ESRI and PostGIS, how much could be explained? Assume one program summarizes all border cells while the other does not. This will give us a percent difference we can easily explain. So, if I have a 10x10km grid cell as the zone, then I could fit 111 90m cells on each border for a total of 444 border cells. Square 111.9 to get 12,321 90m cells that could fit inside the 10x10km grid cell. Divide the border cells by the total 444/12,321*100 = 3.6%. We can explain 3.6% of the difference (if there is any) by just the border cells.

4. First Attempts at Raster Analysis in PostGIS
I thought I was thinking logically when I wrote this...should have just looked at the documentation first!
--Create a vectorized result of the geometric intersection between my raster layer and vector polygon
CREATE TABLE sum_pop AS
SELECT gid
,(gv).val AS population
,(gv).geom AS geom
FROM (
SELECT  a.gid
,ST_Intersection(b.rast, a.geom) gv
FROM perez_grid a, ls_den b
WHERE ST_Intersects(b.rast, a.geom) ) foo;

Spatial results looked good. Just needed to summarize
SELECT SUM(population), COUNT(*) AS cell_count FROM sum_pop;

Results:
Population: 193,492
Cell Count: 9,811
Population Percent Difference: ~7%

 Well...that doesn't look right...how do I explain the additional 3.4%? First things first, what is the deal with the count? Well, that one is pretty easy, when the raster is vectorized, it groups cells of like values. So count doesn't really matter.

5. A Hunch or Swing in the Dark?
Ok, not sure why I decided to try this, but I loaded in the exact same raster dataset except for this time I used 200x200 tile size. Shouldn't make a difference, right? WRONG!!! This still vexes me, if anyone can explain I would be grateful.

Results:
Population: 193,446
Count: 9,787

6. The Correct Method
After reading the documentation, I have a plan of attack. I'll follow their example and see what I get.
CREATE TABLE sum_pop2 AS
WITH 
   feat AS (SELECT gid, geom FROM perez_grid AS b ),
   b_stats AS
(SELECT  gid, (stats).*
FROM (
SELECT gid, ST_SummaryStats(ST_Clip(rast,1,geom)) AS stats
FROM ls_den
INNER JOIN feat
ON ST_Intersects(feat.geom,rast) ) AS foo )
SELECT gid, SUM(count) AS cell_count
  ,SUM(sum) AS population
FROM b_stats
 WHERE count > 0
GROUP BY gid
ORDER BY gid;

Results:
Population: 207,578
Cell Count: 14,400
Population Percent Difference: 0%

SUCCESS, however, I'm not sure why examples must be so verbose!  It is a beautiful query don't get me wrong...

7. Discussion
So whats difference between this amd my first attempt?
The WITH clause simply allows you to join queries, you could right this as a query with a subquery and get the same results. Using ST_Clip(), the raster is never converted to vector, the vector is used to clip the raster than a simple SummaryStats() is performed. That's the bulk of it. The GROUP BY clause aggregates all the rids (aka tiles).
The BIG question is, why does the first attempt produce the wrong answer? The intersection produces geomval pairs for only those cells that intersect, so why the difference? I would have guessed that it would have resulted in more population because it grabs even the slightest sliver that intersects (its vector on vector) while ESRI grabs only cells whose centroid falls within the intersect, but PostGIS resulted in less! 

8. A Simple Alternative to the Correct Method
For those that find the above query verbose and hard to understand, I'll provide an alternative query that produces the same result in similar return time.

CREATE TABLE sum_pop3 AS 
SELECT gid, SUM((ST_SummaryStats(ST_Clip(rast,1,geom))).sum)
FROM perez_grid, ls_den
WHERE ST_Intersects(geom,rast) 
GROUP BY gid

7 comments:

  1. Anthony, can you tell me which was faster? Also, is there a difference in storage size (stored natively for ArcGIS v. stored in the DB)?

    Thanks,
    -Tripp Lowe

    ReplyDelete
  2. Tripp, I'll try to dig up my notes on the speed but off the cuff I can tell you that Arc is much faster. I'll check into the storage as well.

    ReplyDelete
  3. This update is incredibly late...nevertheless, see my recent blogpost for speed and size comparison with ESRI: http://movingspatial.blogspot.com/2012/11/postgis-raster-tabulate-area.html

    ReplyDelete
  4. Anthony-

    I think the answer to your question -- "why does the first attempt produce the wrong [and smaller] answer" -- lies in how ST_Intersection() works. The way you used it, it first runs ST_DumpAsPolygon() to vectorize the raster, then runs ST_Intersection(geom,geom). ST_DumpAsPolygon() works by looking at each tile in your raster independently. It then lumps all of the cells in that tile with the same value into a single polygon (a PG "union").

    This means that if you have five cells in a tile with population of 100, they will turn into a single polygon, spatially identical, but attributed with a pop of 100 instead of 500. Thus, the total population can be underrepresented with this approach. The tile size also comes into effect because ST_DumpASPolygon() operates independently on each tile -- so your answers vary slightly.

    I think the missing piece from the PostGIS documentaion is that ST_DumpASPolygon() is really only suited for nominal/categorical (e.g., land cover) or ordinal rasters, but not other numerical data types

    ReplyDelete
  5. Oh, and another undocumented nugget: apparently ST_DumpAsPolygon() also rounds decimal values to the nearest integer...

    ReplyDelete
  6. Just the answers I was looking for. Thanks, Mike!

    ReplyDelete
  7. I was just using your query under Step 8 (on different data), and found that I was getting different results from ArcGIS Zonal Statistics for a ~10% of my zones. It seems that the PostGIS query can grab cells along polygon edges -- even with the ST_Clip in place, it still doesn't operate on cell centroids. My workaround was to throw in a negative ST_Buffer of a nominally small distance (e.g., -0.000001 decimal degrees) within the ST_Clip, and the results now match Zonal Stats.

    ReplyDelete