Tile size for raster vector overlays in PostGIS
We are looking at PostGIS raster in the context of what is mainly desktop GIS analysis. The intention is to speed up routine operations on large data sets. However we are hoping that elegant data handling will allow us to build some useful data extractor functions that can be shared with other users.
So running efficient vector raster overlays is the key to adopting PostGIS raster instead of using GRASS or even Arc.
The issue of optimal tile size when importing the data needed looking at carefully as it was apparent from the start that it would be critical for speed. There are two types of operations that we are likely to use. The simplest involves extracting raster values at points. The PostGIS query to do that is quite simple. The second involves extracting statistics for all the pixels that fall into a polygon. These queries are rather more complex and the operation is clearly harder to optimise, as efficiency will depend on both the size and complexity of the polygons.
I tested the speed of the query format provided by Pierre Racine by calculating the mean elevation for ecoregion polygons in Mexico using pixels of 0.01 degrees (WorldClim resolution).
First I imported worldclim data using tile resolutions of -t 2×2,10×10,50×50,100×100,200×200 and 400×400. I tried 1×1 but the import hung. All the other import were very quick and smooth and could easily be repeated if necessary.
There are 694 polygons in the ecoregions layer, of varying sizes and complexities. As I was impatient I ran the query first on a subset of 10.
gid,(ST_SummaryStatsAgg(ST_Clip(rast,1, geom, true))).mean
FROM biotiled50,(select * from ecoregions where gid < 11) foo
WHERE ST_Intersects(rast, geom)
GROUP BY gid
order by gid
This was the result.
I gave up on the smallest tiles. The optimum seemed to be around 100 x 100. This gave me the result for all the 694 polygons in just over half a minute.
I thought this was quite impressive given the number and complexity of the polygons. Adding calculations on a second band to the query more or less doubled the time, so some of the time seems to be taken up on the aggregation itself.
I have already shown point overlays in a previous post. My first experiments suggested that very small tiles should speed up the operation through efficient use of the spatial index.
Running this query for the 6700 collections of oaks in Mexico
st_value(rast,1,geom) elevation ,
st_value(rast,7,geom)/10 Coldest ,
st_value(rast,6,geom)/10 Warmest ,
(select * from gbif where genus like 'Quercus') treestab,
biotiled100 where st_intersects(rast,geom)
Produced this relationship between tile size and speed.
Simple enough. The smaller the better. But there doesn’t seem to be much to be gained by going right down to one pixel per tile (which does seem to cause import problems).
So it looks as if at least two versions of raster layers are needed in order to optimise for speed on both operations. Optimum tile size for polygon overlays is also likely to be quite specific to the layer used. However the relationship seemed fairly flat over quite a large range, so I will go for an import at 100×100 and 2×2.
Again, I am impressed by how compact the layers are held in the database, so keeping two versions to hand is not a huge overhead in terms of space.