Friday, June 14, 2013

PostgreSQL - Copy Table to Another Server AND Different Schema

Here I present a method for dumping a table from one server/database and loading into another server/database with a different schema.

This is a spinoff of a previous post (PostgreSQL - Copy Table to Another Server) which shows you how to dump a table from one server to another. The major downfall of that was the inability to specify a different schema - the schema of the table being dumped had to exist on the new server and database.

  • schema.table: this is your table to be transferred
  • -O: do not save ownership
  • from_db: from database
  • from_schema: from schema
  • to_schema: to schema
  • to_db: to database

pg_dump -h serverA -t schema.table -O from_db | sed -e '/^SET search_path = /s/from_schema/to_schema/g' | psql -h serverB to_db

Thursday, January 31, 2013

PostGIS 1.5 - Batch Populate Geometry Columns

When creating new spatial tables – say as the result of an analysis – you should register and add constraints to your table. At least this is the proper thing to do… If you are like me, I get a little lazy and the next thing I know I have 50 tables without proper registration. This is fine if you do not plan on using the tables again or simply plan to dump out as a shapefile, but if you plan on keeping them around and using again, you should register and add constraints.

You can do this on one table simply by running: 
SELECT Populate_Geometry_Columns('schema.table'::regclass);

However, if you have many tables, you can run the following on an entire schema:
AS (SELECT table_schematable_name
                FROM information_schema.tables
                WHERE table_schema 'foo' AND table_type 'BASE TABLE')

SELECT Populate_Geometry_Columns((table_schema::text || '.' || table_name::text)::regclass)FROM t;

Tuesday, January 29, 2013

PostgreSQL - Determining Table and Index Size

This is somewhat of a follow up to my previous post. If you are interested in knowing the size of your postgres table and associated indexes, sequences, etc., you can run the following SQL to find out. Enjoy.

SELECT pg_size_pretty(pg_relation_size(pg_class.oid)), pg_class.relname, pg_namespace.nspname,
CASE pg_class.relkind WHEN 'r' THEN 'table' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 'v' THEN 'view' WHEN 't' THEN 'TOAST'
ELSE pg_class.relkind::text END
LEFT OUTER JOIN pg_namespace ON (pg_namespace.oid = pg_class.relnamespace)
ORDER BY pg_relation_size(pg_class.oid) DESC;

Wednesday, November 21, 2012

PostGIS - Raster Tabulate Area

This post compares - in a very basic sense - tabulating raster area in ESRI and PostGIS.

1. Data
- Raster: 90 meter resolution covering most of the contiguous United States. Represents available land for solar development.
- Vector: Native American lands from the Bureau of Indian Affairs. Roughly 300 distinct boundaries.

2. Storage
-ESRI: 113 MB
-PostGIS: Loaded in using 100x100 tiles. Table = 137 MB, spatial index = 19 MB

3. Tabulate Area - Time Trial
-ESRI: tabulate area tool finished in roughly 1 minute.
-PostGIS: Using the SQL below, finished in roughly 13 minutes. Confirmed spatial index was used.

CREATE TABLE tabulate_area AS

       ,SUM(ST_Count(ST_Clip(b.rast, 1, a.the_geom)) pow(90,2)) AS square_meters
FROM tribal_boundaries a
INNER JOIN available_land b ON ST_Intersects(a.the_geom, b.rast)

4. Compare Results
I found that PostGIS likes to automatically snap to the raster whereas ESRI defaults to not snapping. If not snapping in ESRI, results between ESRI and PostGIS are different (as would be expected). If snapping in ESRI, then results are exactly the same. Loaded in the ESRI results and compared the two with the following SQL:

,post.sqm AS post_sqm
,esri.sqm AS esri_sqm
,post.sqm - esri.sqm AS tabulate_diff

FROM tabulate_area AS post
LEFT OUTER JOIN esri ON post.tribe_name = esri.tribe_name;

5. Why is PostGIS so slow? It's not! Read on...
Well...12 minutes slower is simply not what is the reason? I decided to look at the tile size on the raster. My initial thought was, the tighter the tiles, the better the spatial index, so I set it at 100x100 or roughly 9 kilometers ((90cell sizex100tile size)/1000meters). Thinking on it, sure my ST_Intersects() was fast, but my clip was choking on all the individual tiles. So I loaded in the raster again, this time at 1000x1000tile size. Ran the query again and que bueno! New runtime for PostGIS is 1 1/2 minutes.

Lesson learned: think about tile size and the analysis you'll be performing before you load your raster.

Thursday, November 1, 2012

PostGIS - Majority Point Value in Polygon

Here I present a simple method for determining the majority point value within each polygon in PostGIS. This method utilizes the postgres Window function to rank order the count of point values within each polygon.  

--Majority Count of Point Values

,COUNT(*) AS point_count
,RANK() OVER(PARTITION BY polygons.gid ORDER BY COUNT(*) DESC) AS point_value_rank

FROM polygons, points
WHERE ST_Intersects(polygons.geom, points.geom)
GROUP BY polygons.gid, points.value ) i
WHERE i.point_value_rank = 1;  

Oh, and by the way. My SQL has always look terrible because I usually get lazy halfway through edited the color etc (apologies). But now I have come across SQL "Prettify". Neat.

PostgreSQL - Copy Table to Another Server

My previous workflow for copying tables from one server to another was (1) pg_dump on server "A" then (2) SCP to server "B" then (3) restore via psql. Not a terrible workflow, but my new workflow is much more eloquent. 

First, I login to server B (my new home for the data), then pg_dump with connection params set to serverA (current location of data). Then pipe directly to psql with connection params set to serverB a.k.a. localhost. 

pg_dump -t schema.table -h serverA -U username databaseA | psql -h localhost -U username databaseB;

An important side note:

If your postgres databases are of different vintages, you should always pg_dump using the latest vintage i.e. If serverB is 9.1 and serverA is 8.4, dump using serverB. pg_* are usually backwards compatible, but not always forwards compatible. 

Saturday, October 27, 2012

Split CSV into multiple files

Receiving files in a format that is...less than helpful for your project needs is almost an everyday occurrence. On this occurrence I received roughly 30,000 files of hourly solar irradiance data. The files had multiple years all crammed together i.e. file 1 had hourly data from years 2002 - 2008.  In my project I needed to separate the individual years of hourly data - Bash to the rescue.

#Get a list of all CSV files
FILES=$(find -name *.csv)
#loop through the files
for file in $FILES
      #Get the basename of the current file
      #get the path of the current file
      #Split file into yearly data files.
      split -l $file 8760

      #Rename files to appropriate years. Split automatically names your output xaa, xab..and so
      # forth depending on how many files result from your split. If you know how many files
      #will result from your split, its easy to rename appropriately.
      mv xaa  $path"/"$base"_2002.csv"
      mv xab  $path"/"$base"_2003.csv"
      mv xac  $path"/"$base"_2004.csv"
      mv xad  $path"/"$base"_2005.csv"
      mv xae  $path"/"$base"_2006.csv"
      mv xaf  $path"/"$base"_2007.csv"
      mv xag  $path"/"$base"_2008.csv"
  echo wrote $file