Using PostGIS raster to hold WorldClim data

Although raster support for PostGIS has been around for a while, PostGIS2.0 was the first version to include it by default. There have always been many workarounds that allow data from raster layers to be used in within PostGIS without importing native raster layers as such. There are also many very effective ways of processing raster data outside PostGIS. So, although the idea of being able to integrate raster data within a database in this way is attractive, I was not very sure whether raster support in PostGIS would actually prove to be useful for our own research.

Several years ago I looked at using PostGIS to manipulate WorldClim data.
https://duncanjg.wordpress.com/2010/08/09/import-worldclim-data-into-grass/
https://duncanjg.wordpress.com/2008/07/05/raster-data-in-postgis-2/

These data are commonly used for species ditribution modelling. The Worldclim data are now very well intergrated into a modelling workflow using the R packages raster and dismo, all thanks to the impressive work of Robert Hijmans.

So, maybe there is no need at all to place Worldclim layers in a data base anymore. I have been using Hijman’s raster package for most of the species distribution modelling I do these days. However, although the raster package is fast and convenient, allowing WorldClim data to be kept on disk, rather than loaded into memory in R, there can still be some issues with extremely large data sets at fine resolutions. So today I looked at PostGIS raster in the context of the routine task of running a point overlay for multiple raster layers in the context of a large data set with fine resolution raster layers.

The first step was to export the data I have held in GRASS back into a multiband geotiffs.


i.group group=bio subgroup=bio input=alt,bio1,bio2,bio3,bio4,bio5,bio6,bio7,bio8,bio9,bio10,bio12,
bio13,bio14,bio15,bio16,bio17,bio18,bio19

r.out.gdal -c input=bio output=bio.gtiff

i.group group=clima subgroup=clima input=tmin1,tmin2,tmin3,tmin4,tmin5,tmin6,tmin7,tmin8,tmin9,tmin10,tmin11,tmin12,tmax1,tmax2,
tmax3,tmax4,tmax5,tmax6,tmax7,tmax8,tmax9,tmax10,tmax11,tmax12,prec1,prec2,prec3,prec4,prec5,
prec6,prec7,prec8,prec9,prec10,prec11,prec12

r.out.gdal -c input=clima output=mintmaxtprec.gtiff

PostGIS raster holds tiles of raster data as rows in a table. A whole raster layer can be held in a single row, or it can be split up into many small tiles. I assumed that a very fine grid of tiles would speed up processing when using a spatial index. The tile size refers to the number of pixels in each tile.
Provisionally the following import seemed to work well using just tiles of 5 x 10 pixels. Larger tile size slowed down point overlay.


raster2pgsql -s 4326 -d -I -C -M bio.gtiff -F -t 5x10 bioclim|psql -d latam
raster2pgsql -s 4326 -d -I -C -M mintmaxtprec.gtiff -F -t 5x10 mintmaxt|psql -d latam

The data are held in a very compact form in PostGIS. The whole database is under than 1GB, while the combined geotiffs are over 3.5 GB.


SELECT pg_size_pretty( pg_database_size( current_database() ) ) As human_size
, pg_database_size( current_database() ) As raw_size;

Then, once in PostGIS I added in a layer with two million species occurence records. The size of the combined database remained below 1GB.

Queries using so many bands at once can be rather verbose, as each band must be mentioned explicitly in the query. But this is not a problem as the queries can be recycled once written. It is just a bit tedious the first time.

Running a query for the 2,000,000 or so records at once is not a good idea when testing the concept. However I found that they should run quickly enough to be incorporated into a workflow that makes use of all the results, such as species distribution modelling. Even with this extensive coverage at 1 km resolution the overlay returned over 3,000 results per second. So extracting even large subsets of the results is OK in real time.
For example, adding some key bioclim variables to a table for all the species of oaks took under 3 seconds.


select treestab.*,
st_value(rast,1,geom) elevation ,
st_value(rast,2,geom)/10 AnTemp,
st_value(rast,7,geom)/10 Coldest ,
st_value(rast,6,geom)/10 Warmest ,
st_value(rast,12,geom) AnPrec

from
(select * from trees where genus like ‘Quercus’ and species like ‘%’) treestab,
bioclim where st_intersects(rast,geom)

This can be made into a handy function to use in the QGIS database manager by taking advantage of the fact that building either a table or a view registers a recordset.

 

create view getbioclim as
select treestab.*,
st_value(rast,1,geom) elevation ,
st_value(rast,2,geom)/10 AnTemp,
st_value(rast,7,geom)/10 Coldest ,
st_value(rast,6,geom)/10 Warmest ,
st_value(rast,12,geom) AnPrec

from
(select * from trees where genus like ‘Quercus’ and species like ‘segoviensis’) treestab,
bioclim where st_intersects(rast,geom);

CREATE FUNCTION fgetbioclim (char, char) returns setof getbioclim as
‘select treestab.*,
st_value(rast,1,geom) elevation ,
st_value(rast,2,geom)/10 AnTemp,
st_value(rast,7,geom)/10 Coldest ,
st_value(rast,6,geom)/10 Warmest ,
st_value(rast,12,geom) AnPrec

from
(select * from trees where genus like $1 and species like $2) treestab,
bioclim where st_intersects(rast,geom)’

LANGUAGE SQL;

The just typing this

select * from fgetbioclim(‘Quercus’,’ol%’)

Will add some basic climate records to all the oaks species starting with ‘ol’, ( I.e in this region Quercus oleoides).

Producing these results in QGIS.

So, a point overlay is very fast. It definitely compares extremely well with equivalents in R and GRASS in terms of speed, with the the added advantage of scaleablity and convenience once the queries have been setup and the base data loaded into PostGIS. Notice that no downscaling of the raster using nearest neighbour is necessary in order to run queries in real time with a subset of the data, even with this very large raster layer. The overhead involved concerns setting up a PostGIS database and learning the basics of spatial SQL.

Finding aggregated stats from a vector overlay is another matter. I have not as yet found the best way of optimising queries for this important class of operations. The official documentation on this is not yet fully consolidated. Hopefully I can provide an example of good practice using PostGIS raster on this important data set. Pierre Racine contacted me to in response to this post and has already provided some useful tips.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s