Effect of tile size and data storage on PostGIS raster query times

Many PostGIS raster queries now run much faster than previously due to optimisation of the underlying code. However speed is still greatly affected by decisions made regarding tile size and data storage. This is an update to a post published around a year ago.

Sensitivity tests are easy to run from R using the system.time function.

The RStudio document version of the following post is available here.

http://rpubs.com/dgolicher/8809

The markdown version.

Testing the effect of tile size on query speed

The files needed to run these tests can be downloaded from here.

https://dl.dropboxusercontent.com/u/2703650/Courses/geostats/geostats.zip

Set the working directory to where you placed the files.

setwd("/home/duncan/Dropbox/Public/Courses/geostats")
rm(list=ls())
library(RODBC)
library(rgdal)
library(raster)

Create temporary data base on the fly in R

When using PostGIS as a desktop GIS from R I quite often create a temporary database from R for a session. This avoids clutter when the only purpose of the database is to provide some functions for data processing.

system("dropdb temp") #If already exists
system("createdb temp")

In order to use a ODBC connection to the temp database you need to add it to odb.ini.

sudo gedit /etc/odbc.ini

Paste the following text into this file and then save it (this is for 32 bit machines)

“`
[temp]
Driver = /usr/lib/i386-linux-gnu/odbc/psqlodbcw.so
Database = temp
Servername = localhost
Username = postgres
Password = postgres
Protocol = 8.2.5
ReadOnly = 0
“`
Now we can connect to the database from R

con <- odbcConnect("temp")

Add in the PostGIS extension and PLR

odbcQuery(con,"CREATE EXTENSION POSTGIS")
odbcQuery(con,"CREATE EXTENSION POSTGIS_topology")
odbcQuery(con,"CREATE EXTENSION PLR")

Load the shapefiles

This is quick for smallish shapefiles. If files are larger and form part of a common workflow it would clearly be better to set up a permanent database and load them once.

com<-"shp2pgsql -d -s 4326 -I shapefiles/oaks.shp oaks| psql -d temp;"
system(com)
com<-"shp2pgsql -d -W LATIN1 -s 4326 -I shapefiles/mex_ecoregions.shp eco| psql -d temp;"
system(com)

Returning a subset of the data to R

We will now pull two spatial objects back into R so that we know that they are exactly equivalent to those that will be used in the subqueries within PostGIS. One convenient way of doing this is to first set up a “getquery” function as was shown in the geostat tutorial. This uses a temporary view and rgdal to return any query as a spatial object.

getquery<-function(query){
 query<-paste("create view temp_view as ",query,sep="")
 odbcQuery(con,query)
 result<-readOGR("PG:dbname=temp", "temp_view")
 odbcQuery(con,"drop view temp_view")
 return(result)
}

polys<-getquery("select * from eco limit 200")
pnts<-getquery("select * from oaks limit 200")

Timing a point overlay in R

If the raster layer is small enough to fit into memory R will easily beat PostGIS for speed on a point on raster overlay. The raster layer can be placed in memory by turing it into a brick.

r<-raster("/home/duncan/geostats/elevation.tif")
rbrick<-brick(r)
system.time(extract(rbrick,pnts))


## user system elapsed
## 0.008 0.000 0.008

So, that is going to be hard to beat.

A fairer comparison, that is appropriate for large rasters (which could not fit in memory) would be to measure the time R takes while keeping the data on disk.

RTimePoints<-system.time(extract(r,pnts))[3]
RTimePoints

## elapsed
## 0.147

Timing a polygon overlay in R

This is much slower, so to give R any chance at all we will use the brick.

RTimePolys<-system.time(extract(rbrick,polys))[3]
RTimePolys

## [1] 50.91

This should be easy to beat.

Add some PLR functions for aggregation

One of the very nice elements of using PostGIS for polygon on raster overlays is that any R function can be run on the results in order to summarise them by PLR, before sending them back to the R session.

query<-"CREATE OR REPLACE FUNCTION rmedian (float[]) RETURNS float AS 'xx<-na.omit(x)
median(x)'
LANGUAGE 'plr' STRICT;"
odbcQuery(con,query)

query<-"CREATE OR REPLACE FUNCTION rmean (float[]) RETURNS float AS 'xx<-na.omit(x)
mean(x)'
LANGUAGE 'plr' STRICT;"
odbcQuery(con,query)

`

Timed overlays when data is held outside the data base

Now run through some timed overlays. For this exercise we can load the raster layer several times, changing the tile resolution for each.
Note that the -R flag is used. This results is the file being registered in the database, but the data remain outside.

tm_point<-numeric()
tm_poly<-numeric()

tiles<-c(20,50,100,200,400,800,1200)
for (i in tiles){
  command<-paste("raster2pgsql -s 4326 -d  -M -R /home/duncan/geostats/elevation.tif -F -t ",i,"x",i," elev|psql -d temp",sep="")
system(command)

  query<-"select o.gid,genus,species,st_value(rast,geom) from elev, (select * from oaks limit 100) o where st_intersects(geom,rast)"
s<-system.time(d<-sqlQuery(con,query))
tm_point<-c(tm_point,s[3])

query<-"select gid,rmedian((st_dumpvalues(st_union(st_clip(rast,geom)))).valarray) FROM elev,(select * from eco limit 100) foo WHERE ST_Intersects(rast, geom) GROUP BY gid order by gid"

s<-system.time(d<-sqlQuery(con,query))
tm_poly<-c(tm_poly,s[3])

}

Results

TITLE<-paste("Polygon overlay time \n","R time = ",round(RTimePolys,2),sep="")
plot(tiles,tm_poly,type="b",main=TITLE,xlab="Tile size (pixels)",ylab="Time (seconds)")


TITLE<-paste("Point overlay time \n","R time = ",round(RTimePoints,2),sep="")
plot(tiles,tm_point,type="b",main=TITLE,xlab="Tile size (pixels)",ylab="Time (seconds)")
abline(h=RTimePoints,lty=2,col=2)

unnamed-chunk-132
unnamed-chunk-131

So, PostGIS clearly wins by around an order of magnitude on the polygon overlays. PostGIS can also just beat R for a point on raster overlay from a file, providing an optimum tile size (between 200 and 400 in this case) is used. This optimum may be sensitive to the file size and other factors.

Timed overlays when data is held within the data base

Intuitively I would have thought that queries would run faster when data values are loaded into the database. However this is not the case, apparently because the TOAST system used for data storage carries a significant overhead for data retrieval.


tm_point<-numeric()
 tm_poly<-numeric()

for (i in tiles){
 command<-paste("raster2pgsql -s 4326 -d -M /home/duncan/geostats/elevation.tif -F -t ",i,"x",i," elev|psql -d temp",sep="")
 system(command)

query<-"select o.gid,genus,species,st_value(rast,geom) from elev, (select * from oaks limit 100) o where st_intersects(geom,rast)"
 s<-system.time(d<-sqlQuery(con,query))
 tm_point<-c(tm_point,s[3])

query<-"select gid,rmedian((st_dumpvalues(st_union(st_clip(rast,geom)))).valarray) FROM elev,(select * from eco limit 100) foo WHERE ST_Intersects(rast, geom) GROUP BY gid order by gid"

s<-system.time(d<-sqlQuery(con,query))
 tm_poly<-c(tm_poly,s[3])

}

Results

 TITLE<-paste("Polygon overlay time \n","R time = ",round(RTimePolys,2),sep="")
 plot(tiles,tm_poly,type="b",main=TITLE,xlab="Tile size (pixels)",ylab="Time (seconds)")
 TITLE<-paste("Point overlay time \n","R time = ",round(RTimePoints,2),sep="")
 plot(tiles,tm_point,type="b",main=TITLE,xlab="Tile size (pixels)",ylab="Time (seconds)")
 abline(h=RTimePoints,lty=2,col=2)

unnamed-chunk-151unnamed-chunk-152

Polygon on raster overlays take up to twice the time. Point on raster overlays are not only much slower, but the time is an almost linear declining function of tile size.

Clean up

 odbcClose(con)
 system("dropdb temp")

Leave a comment