Loading raster data in PostGIS and visualising the results in QGIS

I have been using PostGIS (on and off) for several years now, so I am fairly comfortable with the basics of spatial SQL. However I only began looking at PostGIS raster a couple of weeks ago. My current knowledge of PostGIS raster is summarised in the figure below.

I am hardly in the position to provide authoritative advice on how best to use PostGIS raster. However, it is always easier to help others with problems while you are still in the process of solving them yourself. Once things become routine it is tempting to provide terse “RTFM” answers to beginners questions when you have forgotten that things are not really obvious at all.

In the case of PostGIS raster I think that it is far to say that the “FM” is still a work in progress. Those who write with the most authority on the subject are still very busy working on the code itself. So, in order to help postgrads and postdocs using PostGIS with me I have provided an explanation of how to load data into PostGIS raster, in the hope that it may also be of some use to others at the base of a rather steep learning curve.

I will use an example that may be helpful for one of the students working with me. The task is to load the GLOBCOVER2008 land classification for the Andes into PostGIS and then extract percentages of each land cover for each polygon representing the extent of occurrence of some species of interest.

The layer of interest is restricted to the Andes.

The layer I received had already been filtered in some way in order to include only pixels over 1500m. However the colour shown when I added it to a QGIS project flagged up an issue. A quick investigation using the value tool in Quantum GIS showed that this particular layer had been reclassified to give pixels below 1500m a value of zero, rather than clipped out so that they were treated as no data values. This would cause problems, as the zeros will then be treated as an additional (non existent) land use type if there is already a no data value (not zero) that is formally registered for the layer in the header.

So, it is always a very good idea to use gdalinfo before importing any layer in order to check the details. The syntax is simple.
In a terminal type gdalinfo followed by the name of the file. ( For newcomers to Linux you open a terminal in Ubuntu 12.04 by pressing control + alt + T. Then you can move to the folder where the data is using cd to avoid having to type the full path.)

gdalinfo /home/duncan/rasters/andes/altglob2.tif

So, just as I suspected there is a conflict. I was right to be suspicious. The registered no data value is 255, not zero. I have the full global coverage so I could just run the correct masking procedure. An easy way to correct the layer as sent is simply to use gdal_translate to produce a new layer with the correct no data value. This can be done directly on the command line, or in the GUI by choosing the relevant option in Quantum GIS from the raster menu.


gdal_translate -a_nodata 0 -of GTiff /home/duncan/rasters/andes/altglob2.tif /home/duncan/rasters/andes/altglob2a.tif

This sets the intended no data values to real no data values. The contrast between the pixels inside the rectangle and those outside now vanishes when viewed in QGIS.

Now that this issue has been fixed, the corrected layer can be loaded into PostGIS using a command typed, (or pasted), into the terminal.

raster2pgsql -s 4326 -d -I -C -M -R -l 4 /home/duncan/rasters/andes/altglob2a.tif -F -t 1000x1000 globcov1000x1000|psql -d latam2

In my laptop the command registers the raster layer that I have placed in my /home/duncan/rasters/andes directory into a database that I already have created in PostGIS called latam2. It also loads an overview layer for quicker, but less accurate, processing.

This long and complex looking command has to be broken down into bits in order to explain what is going on.

Each flag (the letters that start with a -) determines something about the way the data is loaded. In the very near future there will be a GUI to do this, but at present the only way of reliably loading a raster is through the command line. The official documentation is actually very clear on this and I found it very easy to follow. It is worth looking at in detail,

http://postgis.refractions.net/docs/using_raster.xml.html#RT_Loading_Rasters

The flags I usually include are

-d

This drops an existing table with the same name if there is one. As the data is all there in the Geotiff and is not lost this is not a dangerous option at all and it allows you to load the data again using a different tile size easily. So I always use it by default.

-I

Add a spatial index. This is very important for the speed of the queries. There are some special circumstances when you don’t want an index, but usually you do, so keep this.

-C

Enforce constraints. I.e check that everything is properly registered. A very good idea.

-M

Vacuum analyse the table. I.e tidy up as you go. Another good idea.

-R

This is an important flag. If you include it then the actually values of the raster layer are not placed in the database itself. They are “registered” and then read from the original file on disk when requires. This saves space in the database and also (counter intuitively) may actually speed up queries. However it means that the table cannot be transferred without the original file. Furthermore the file must be in the same absolute path if the table is moved from one computer to another using a backup. You must always give the full path of the file when using this option. If you later move the file then the connection is lost. So be careful.

-s 4326

This is the srid (coordinate reference system), which is usually going to be an EPSG code. It is a good idea to be explicit about this, although you must check carefully that you are not over riding the srid defined for the layer. Again use gdalinfo and check carefully.

-t 1000×1000

Critical This is the number of pixels per tile (so 10×10 is ten pixels by ten for each tile, 1000×1000 produces very large tiles) Understanding this is the key to everything. Providing the tile size is not exceptionally small (< 5 x 5) loading a raster with the -R flag will run through in seconds rather than minutes, even for large layers, so the process can be very easily repeated (up arrow in the terminal) if adjustments to the tile size are needed. In some cases different tile sizes are needed to optimise the speed of queries. In very general terms, the smaller the vector element that you are interested in overlaying on the raster, the smaller the tile size should be in order to optimise query speed. It is a little more complex than that, but this is a reasonable starting point. In this case we are interested in a lot of long thin convex hulls that tend to be over 1000 pixels in length. There is no reason for the tiles to be square, and in this case rectangles may prove to be more efficient, but I’ll leave that.

-l 4

This produces an overview layer at a coarser resolution using nearest neighbour resampling. The main intention behind this option is to allow faster visualisation of the layer. You can build many overviews in the same way that you build pyramids for a Geotiff. However, at the time of writing there is no simple way of actually visualise the results from this in Quantum GIS.
The option can still come in very handy however, as it allows you to run queries quickly on a lower resolution version of the data These can be used to test queries quickly or (if the results are good enough for the purpose in hand) they can be used as approximations. Even if the -R flag was used to register the original data outside the data, the layers produced using the -l flag are held within the database.
Overviews will appear in the database as raster layers with a prefix consisting of o_ plus the thinning factor (in this case 4) plus the table name. I.e. here it is going to be o_4_globcov1000x1000

The command specifies the name of the table that is going to be added to the database. I have used globcov1000x1000 in order to document the tile size, but you may have other ideas. It doesn’t matter what you call the table, as long as you remember its name.

So, the first part of this command (prior to the |) produces a query. This prepares all the data for loading.

The short cut that avoids loading the query as a an explicit second step is to pipe the results (using |) strait into a program (psql) that runs queries in the database from the command line.

So, the second part of the command that looks like this.

|psql -d latam2

This short version will work fine providing that you already have permissions to connect to the database and run queries. A longer, more generic, version would include explicit detail regarding the data connection, user name and password.

Although this may all seem extremely complex and esoteric, in practice once you have successfully loaded one layer it is very quick and easy to load another just be pressing the up arrow while in the terminal and changing any options that you need to.

If this has run through without errors then the layer is in now registered in your PostGIS database.

As with all command line stuff it is easy to use a previous command as a model for a later one. So things are not really as complex as they appear, providing you can sort out any initial issues.

But .. I expect you to have some issues at this stage.

If the command won’t run (which is more than likely if you have no experience at all with PostGIS) then the issue is likely to be one, or both of the following.

If you have just installed your first PostGIS data base and still are not sure how it all works then it is almost inevitable that your Linux username has not been registered in the list of database users. You still do not have any permission to use or change the data base until you register your name as a super user.

In this case you first need to add your name to the list of superusers of Postgresql. This can be done directly from the command line but it is probably much clearer if you use PgAdmin. When first setting up PostGIS you must have used the default postgres super user. Log in with that name and password and then add your name to the roles.

Once you are registered as a super user yourself you can then run any command that alters the database using pgadmin, the command line directly or even in Quantum GIS database manager without having to change to the default postgres user.

Another problem is very likely to occur at this stage arises if you have not properly loaded all the raster functions into your database. This may have happened if you missed out a step during installation. Unless you are aware of the structure of Postgis2.+ it is very easy to do.

So, if you keep getting “transaction rolled backed” type errors even after your name is registered as a super user this is the most probable explanation. Don’t panic, it is easily and quickly fixed.

In order to make the logic behind the fix more transparent and easy to follow I suggest that you use PgAdmin rather than the commandline. Open a query window with the database that you are using active.

Then browse to /usr/share/postgresql/9.1/contrib/postgis-2.1

You should see something like this.

Now if you load any of these sql files and run the queries through (green arrow button) you will add all the functions that they contain into your database. Although you do need to be a little bit careful if this a vital production database (i.e. backup first if the database contains all the projects data!) in general you can’t really damage anything at all this way. If the functions are already there you will just get messages to that effect.

If not, then the query will run through and will update your database with all the additional functions that you need and didn’t have previously.
So, make sure that all the raster functions are loaded and updated and then go back to the terminal and run the raster2pgsql line again. It should now work fine.

Notice that you can also use exactly the same procedure to spatially enable a non spatial database through PGadmin or to update to a newer version of PostGIS, providing that you have also installed the relevant compiled binaries, as will happen automatically when you update in Linux.

So, now, assuming that your layer has loaded, what can you actually do with it?

Unfortunately the gdal driver that the Quantum GIS PostGISraster plugin uses is not (yet) working properly so you cannot load the PostGIS layer itself in Quantum GIS. However don’t worry at all about visualising the raster layer through the database. There is no need to do this. You have a perfectly good GeoTIFF that you can use in your QGIS project to look at the layer. The reason for having the layer within the database is to interrogate it and add the results of queries into an analysis.

That said, just out of interest, it is worth looking at how the tiling works in order to understand the format. SO, open QGIS database manager and type this query.


select rid,rast::geometry from globcov1000x1000

This will (instantaneously) produce the geometry for each tile.

If you load the layer you can label the tiles by rid in order to get an idea of how it is all working. Each tile is a row in the database. The raster data themselves are held as a two dimensional matrix within each tile. This format is obviously useful for managing very large raster layers when using a mapserver Individual tiles can be selected and served up to users without having to load the whole layer.

The importance of selecting the right tile size when running computationally intensive operations such as polygon on raster overlays should also be apparent with an example. Let’s select one of the clipped convex hulls of interest.

The multipolygon only intersects 9 out of the 160 tiles. A correctly written query will therefore restrict its attention to these tiles, thus saving a lot of unnecessary computation.

In this case obtaining a count of all the pixels for each land use class that fall within the multi-polygon still takes over a minute, even though the operation is restricted to a few tiles.

NOTE: See comment on speed issues at the foot of the post. Installing the latest version speeds up this query by a factor of 10!


select
binomial, (pvc).value covclass, sum((pvc).count) pcount
FROM
(SELECT
binomial,rid,
ST_ValueCount(ST_Clip(rast, geom, true)) pvc
from
(select binomial,geom from
redlist_species where binomial like 'Baccharis_lati%') v,
globcov1000x1000 r
where st_intersects(rast,geom)
group by binomial,rid,geom
)foo
group by binomial,covclass
order by binomial,covclass

Why is such a simple task apparently so slow? PostGIS indexing should speed up queries! The reason is apparent on closer inspection. The clip in this case leads to many very tiny polygons that need to be aggregated, so the computer is doing much more work than you might at first expect. … But with the new ST_Clip it still runs in a reasonably short time

When PostGIS takes longer than you expect to complete a query it is always because you have given it more to do than you first realised (although in this particular case there are may also be some underlying performance issues with St_Clip that are being fixed HAVE been fixed!). Given the large extent of the polygons in comparison to the raster resolution you could simply take advantage of the nearest neighbour overview layer in order to use a sample of the pixels and run the query in a fraction of the time.


select
binomial, (pvc).value covclass, sum((pvc).count) pcount
FROM
(SELECT
binomial,rid,
ST_ValueCount(ST_Clip(rast, geom, true)) pvc
from
(select binomial,geom from
redlist_species where binomial like 'Baccharis_lati%') v,
o_4_globcov1000x1000 r
where st_intersects(rast,geom)
group by binomial,rid,geom
)foo
group by binomial,covclass
order by binomial,covclass

Now it completes in just 2 seconds, with 1 in 16 pixels being used. This would make little difference to the overall result in this particular case. There are so many other causes of uncertainty that make attempts at precision misleading.

5 thoughts on “Loading raster data in PostGIS and visualising the results in QGIS

    • Thanks Bborie. Pierre told me that there is a C version of ST_Clip already out there but I don’t have it on the build I use for Ubuntu. (ppa:sharpie/postgis-nightly). I am looking forward to trying it when I can upgrade easily. I don’t have time to build from source myself.

      • I decided that I really had to try downloading the SVN version and compile from source, as I know that some of the speed issues are already fixed. The build went very smoothly and took only a few minutes once I had found a few dev libraries that were missing. Now that same query that took over minute runs in 4.5 seconds. Using the overlay it completes in under half a second! So the C code for ST_Clip has fixed the performance issue.

  1. Pingback: Get More Spatial in Your Database with PostGIS 2.1 on OpenShift - Blog Import Demo Site

  2. Pingback: Get More Spatial in Your Database with PostGIS 2.1 on OpenShift - OpenShift Blog

Leave a comment