PostGIS raster statistics with PLR
Although it is simple to move data from PostGIS in and out of R using ODBC there are some moments when it would be nice to use an R function within the database. This is easy enough in Ubuntu. First install the postgresql-plr package and then run the query through your data base to add the language (plr.sql goes into /usr/share/postgresql/9.1/extension in Ubuntu 12.04)
sudo apt-get install postgresql-9.1-plr psql -d YourDataBase -f /usr/share/postgresql/9.1/extension/plr.sql
Now you can use R code (with a few additional complications) directly within PostGIS. This is useful when a query produces multiple values. In PostGIS raster this can occur in two ways. A query can produce an array of raster values, or a query may use a group by clause. The latter requires an aggregate function, but the first is very simple as it produces an array that R will treat as a vector.
So, for example, if we want to extract all the NDVI values for a Mexican Ecoregion using the Modis data imported in the previous post we could write.
select desecon3,mon,yr,(st_dumpvalues(ST_Clip(rast,geom, true))).valarray from (select extract(month from fdate) mon,extract(year from fdate) yr,* from modis.ndvi) m, (select * from ecoregions where desecon3 like '%Depresion Central de Chiapas%') S where st_intersects(rast,geom)
This produces an array of all the NDVI values for each combination of polygon, month and year. The data type is 16 bit signed integer, which has a theoretical range of values from -32,768 to +32,768. The documented data range is from -2000 to +10000. What that means in practice is that all negative values can be discounted as being produced by water or no data. Very low positive values (< 1000) are derived from urban areas or totally barren desert. So, to look at NDVI dynamics for vegetated areas we may want to impose a cutoff at around 1000 to ignore water and built up areas (taking the raw 16 bit values).
In R we would just write
x<-x[x>1000] mean(x)
Using PLR we can do exactly the same thing.
CREATE OR REPLACE FUNCTION ndvi_mean (float[]) RETURNS float AS ' x<-arg1 x<-x[x>1000] mean(x)' LANGUAGE 'plr' STRICT; select desecon3,mon,yr,ndvi_mean((st_dumpvalues(ST_Clip(rast,geom, true))).valarray) from (select extract(month from fdate) mon,extract(year from fdate) yr,* from modis.ndvi) m, (select * from ecoregions where desecon3 like '%Depresion Central de Chiapas%') S where st_intersects(rast,geom)
Of course you could use any other R functions in the same way, providing they return a useful value.
True aggregate functions (that work on many rows rather than an array) are just a bit trickier to set up. The classic example is a function that returns the median (http://www.joeconway.com/plr/doc/plr-aggregate-funcs.html)
This requires two steps. First the r function that does the work is declared in exactly the same way as above.
create or replace function r_median(_float8) returns float as ' median(arg1) ' language 'plr';
Then an aggregate function is added.
CREATE AGGREGATE median ( sfunc = plr_array_accum, basetype = float8, stype = _float8, finalfunc = r_median );
So if (for demonstration sake) we want to take the median value of the mean_ndvis for each month and year in the data set this will do the job.
create or replace function r_median(_float8) returns float as ' median(arg1) ' language 'plr'; CREATE AGGREGATE median ( sfunc = plr_array_accum, basetype = float8, stype = _float8, finalfunc = r_median ); CREATE OR REPLACE FUNCTION ndvi_mean (float[]) RETURNS float AS ' x<-arg1 x<-x[x>1000] mean(x)' LANGUAGE 'plr' STRICT; select yr,mon,median(ndvi_mean) from (select desecon3,mon,yr,ndvi_mean((st_dumpvalues(ST_Clip(rast,geom, true))).valarray) from (select extract(month from fdate) mon,extract(year from fdate) yr,* from modis.ndvi) m, (select * from ecoregions where desecon3 like '%Depresion Central de Chiapas%') S where st_intersects(rast,geom)) foo group by yr,mon order by yr,mon;
If this is turned into view (let’s call it test) then we can import the results into R proper using RODBC
library(RODBC)
con<-odbcConnect("modelling")
d<-sqlQuery(con,"select * from test")
ndvi<-ts(d$median,start=c(2002,7),frequency=12)
plot(stl(ndvi,s.window=12,t.window=30))
This suggests that there is no overall trend in NDVI values in the Central Depression once seasonal fluctuations are accounted for. The yearly changes are minor and the final upturn is an artefact of incomplete data for 2012.
Loading Modis NDVI time series into PostGIS raster
One of the motivations behind the development of PostGIS raster is the need to effectively handle time series of large raster layers. This has always represented something of a challenge for desktop GIS, particularly once the layers become too large to fit into memory. Modis NDVI data is a good example.
The data can be downloaded from the NASA reverb site
http://reverb.echo.nasa.gov/reverb
Using the reverb “shopping cart” provides a list of files that can be downloaded using wget with the -i option. I.e
wget -i data_url_script_2012-12-08_112521.txt
Nasa provides time series of NDVI at a range of temporal and spatial resolutions For work over a large region the 1km resolution is good enough.
The easiest way to handle the files is to use Nasa’s own MRT tool (Modis Resampling Tool), downloaded from the same site. This is Java based. It requires a system restart on Linux to register the path for using commands after installing.
Once all the files have been downloaded I filtered out the HDFs using the following shell command.
mkdir hdfs
ls|grep "hdf$"|xargs -I '{}' mv {} hdfs
After entering the hdfs directory I used R to direct the mosaicing and resampling. I wanted to extract a list of unique dates from the files first. This involved splitting the file names and extracting parts of the string.
setwd("/media/extdrive/Modis/hdfs")
f<-function(x)unlist(strsplit(x,"\\."))[2]
fnames<-dir(pattern="*hdf*")
dts<-unlist(lapply(fnames,f))
yrs<-substr(dts,2,5)
datestring<-strptime(paste(yrs, substr(dts,6,8)), format="%Y %j")
# f2<-function(x)substr(unlist(strsplit(x,"\\."))[5],1,7)
# dts<-unlist(lapply(fnames,f2))
lst<-unique(dts)
date_lst<-unique(datestring)
Now, that done, carefully pasting into the MRT commands within a loop results in reprojected Geotiffs for each month. It does take a little effort to get right as MRT is quite sensitive to the format of it’s prm project files so cat has to be used with just the right spacing.
finaldest<-"/home/duncan/rasters/modis2/"
for (i in 1:length(lst)){
fls<-paste(fnames[dts%in%lst[i]],"\n",sep="")
outfile<-"temp.hdf"
cat(fls,file="temp.txt")
command<-paste("mrtmosaic -i",getwd(),"/temp.txt -s 1 -o ",outfile,sep="")
system(command)
outfile<-paste(finaldest,date_lst[i],".tif",sep="")
cat("
INPUT_FILENAME = temp.hdf
OUTPUT_FILENAME = ",outfile,"
RESAMPLING_TYPE = NEAREST_NEIGHBOR
OUTPUT_PROJECTION_TYPE = GEO
DATUM = WGS84
",file="temp.prm")
system("resample -p temp.prm")
}
Now comes the more interesting bit.
Bborie very helpfully explained Postgresql partitioned tables to me. This is a great way to handle large amounts of identical data that you may want to query as a whole without ending up with a single unfeasibly large table. The Postgresql manual provides very clear guidance based around a similar example.
http://www.postgresql.org/docs/9.0/static/ddl-partitioning.html
So to use partitioning you first set up a parent table from which each subsequent table inherits its characteristics. This can be done directly from R if you want to.
library(RODBC)
con<-odbcConnect("modelling")
query<-"create schema modis2;"
odbcQuery(con,query)
query<-"CREATE TABLE modis2.ndvi (
id serial NOT NULL PRIMARY KEY,
rid int not null,
fdate date not null,
rast raster
);"
odbcQuery(con,query)
Now each sub table needs to be set up as inheriting from the parent table with constraints. I used a table for each year whose values were set from a temporary table used during import. Again careful pasting in R is needed in order to add the appropriate elements to the commands and queries within a loop. The dates are first loaded to the temp file using the -F flag, which adds the file names. The previous code had given the files correctly formatted dates, so it was then just a case of stripping out the first characters and coercing them to dates.
The index is built on the sub-tables.
setwd(finaldest)
fls<-dir()
yrs<-substring(fls,1,4)
for(i in unique(yrs)){
query<-paste("create table modis2.ndvi_",i,"(CHECK (fdate >= DATE '",i,"-01-01' AND fdate <= DATE '",i,"-12-31' )) inherits (modis2.ndvi);",sep="")
print(query)
odbcQuery(con,query)
subfls<-fls[grep(i,fls)]
for (f in subfls){
command<-paste("raster2pgsql -s 4326 -d -M -R /home/duncan/rasters/modis2/",f," -F -t 500x500 temp|psql -d models",sep="")
print(command)
system(command)
query<-"alter table temp add column fdate date; update temp set fdate = substring(filename,1, 10)::date;"
print(query)
odbcQuery(con,query)
query<-paste("insert into modis2.ndvi_",i," (rid,fdate,rast) select rid,fdate,rast from temp;",sep="")
print(query)
odbcQuery(con,query)
}
query<-paste("CREATE INDEX modis_ndvi-",i,"_spindex ON modis2.ndvi_",i," using gist(st_convexhull(rast));",sep="")
print(query)
odbcQuery(con,query)
}
In the next post I will show how these layers can be queried in PostGIS. Although it is still not possible to visualise PostGIS raster layers clearly in QGIS, the QGIS value tool can be a useful way of looking quickly at the original Geotiffs.
Ten fold speed up using PostGIS2.1.SVN
One of the tasks that is commonly used in most spatial analysis at some point involves obtaining zonal statistics from a vector on raster overlay. This has always been rather inelegant using GRASS or Arc as it involves several steps. A PostGIS raster query simplifies the work flow.
However in some cases my queries were taking much longer to run than expected using PostGIS2.0. The issue concerned ST_Clip. The function was originally written in pgsql. There is now a C version, but I had installed PostGIS on Ubuntu 12.04 using the ppa:sharpie/postgis-nightly repository. That package has not been rebuilt since October.
So I decided that I just had to compile the SVN version from source. This was very quick and easy after tracking down a handful of missing libraries reported on the first failed make. In case it is useful to others who don’t want to try compiling, here is the deb package that I built last night on Ubuntu 12.04 precise 32 bit.
And one today on the 64 bit machine.
The difference is impressive. The following query, shown on my post a couple of days ago, took over a minute to run, which was forgiveable due to the complexity of the multipolygon geometry that covered several tiles, but rather disappointing.
select binomial, (pvc).value covclass, sum((pvc).count) pcount FROM (SELECT binomial,rid, ST_ValueCount(ST_Clip(rast, geom, true)) pvc from (select binomial,geom from redlist_species where binomial like 'Baccharis_lati%') v, globcov1000x1000 r where st_intersects(rast,geom) group by binomial,rid,geom )foo group by binomial,covclass order by binomial,covclass
Now it takes just under 5 seconds!
So I know what I have been missing out on! Many thanks to Pierre and Bborie for pointing this out to me, and of course for all the work they put into programming PostGIS raster. Now all that I need to really show it off is that stable gdal driver.
An easy way of updating is to install the package using Ubuntu software centre (I apologise for any notice you may get that the package is of “bad quality”, this is due to poor documentation not the quality of the build). and then run the SQL files found in /usr/share/postgresql/9.1/contrib/postgis-2.1 in the data base you want to upgrade from PGAdmin as shown on this post.
The 140 or so complex multipolyons overalain on the original resolution raster now runs through in three minutes.
Building more informative climate layers for species distribution modelling
The Worldclim data set is a fantastic resource for species distribution modelling. Not only does it include quality controlled monthly temperatures (minimum and maximum) and precipitation data but it also provides 16 derived bioclimatic layers that are designed to be universally applicable and interpretable. These layers are quite simple functions of the original data and can be easily recalculated from the raw monthly variables. I have previously used an additional derived variable for modelling aimed at representing the length of the growing season in the tropics. This was the number of months with more than 100mm rainfall. However I was never very happy with the arbitrary nature of this variable. The growing season really should be based on soil water balance in some way.
Mario Gonzalez, who was trained in agronomy has always insisted that actual evapo-transpiration is a key variable that determines both tropical productivity and biodiversity (Gonzalez et al 2004). I implemented a regional “bucket model” for Chiapas some time ago but I did not use it in publications as I was slightly uncertain as to whether the necessary simplifications could be justified. However experience using machine learning to derive SDMs has convinced me that more mechanistic approaches to deriving variables are absolutely necessary in order to avoid finding “pseudo-niches” with no biological justification from combinations of variables that just happen to coincide with points where species are found.
So I looked again at my implementation of a regional water balance model in order to derive layers for input for SDMs.
A visual comparison seems to produce a very convincing match to regional patterns of plant growth as shown by the NASA Blue marble images. I am now fully convinced that with a little additional work verifying output against Modis derived NDVI its use in modelling can be robustly justified.
Here is an animation of a 12 month bucket model.
The units are (very approximately) soil moisture in mm in a profile that assumes 500mm to be saturated and 200mm to be permanent wilting point (i.e no growth possible).
And a visual comparison can be made with the Modis derived Blue Marble layer (more formal analysis with NDVI to come in a later post).
Side by side (note that September and November Blue Marble are missing from the current Nasa WMS.
The visual match is quite compelling, with the exception that a full (soil moisture) bucket in the temperate winter (Dec to Feb) of the Eastern US does not coincide with high NDVI (blue marble greenness) due to cold temperatures at these latitudes. This could be built into a model as well. However the main purpose of the bucket model taken alone is to provide input for modelling species distribution in Mexico where winters are less pronounced in terms of temperature differences.
The R code for the model is available here.
http://rpubs.com/dgolicher/2964
And the low res resampled WorldClim layer for input to this code from here.
https://dl.dropbox.com/u/2703650/Analyses/BucketModel/mintmantprec.tif
There is still a lot of work needed to justify this approach formally, but it certainly looks very promising. A simplified informal argument goes like this.
A key determinant for many plant species is length of the growing season. In the tropics the growing season is limited mainly by water availability rather than temperature.
A conventional rule is that precipitation in the tropics exceeds evapotranspiration during months with over 100mm of rainfall. We have used this variable as an addition to the bioclim variables in previous published modelling studies. However this variable does not take into account the true water balance. It introduces serious bias at higher latitudes and elevations. It will also produce unrealistically sharp boundaries, as does any variable based on an arbitrary categorisation of what is in fact a continuum. Several of the other Bioclim variables suffer from the same defect.
Evapotranspiration is a complex process that integrates all climatic variables. Empirical measurements of pan evaporation produce an estimate of potential evapotranspiration (PET). This can also be calculated from first principles using the Penman Montieth equation and its derivatives. However, actual evapotranspiration (AET) is often much lower than PET as it is affected by soil water availability. Eco-physiologists and micro-meteorologists can measure (and model), AET quite accurately for relatively restricted areas such as fields and some forest canopies using methods such as eddy flux covariance. However, scaling up to larger spatial extents is challenging.
Precise quantitative measures of AET are impossible to produce without detailed knowledge of soil depth, soil texture, cloudiness, albedo among other factors. However, for species distribution modelling we need to include a layer that accurately depicts relative (rather than absolute) differences in the process over appropriate spatial and temporal scales in a realistic manner. It is the pattern that is vital to model accurately, not than the absolute pixel values.
One possibility would be to use Modis derived NDVI. The NDVI is correlated with productivity, as is AET, and areas with high NDVI have higher transpiraation rates than areas with low NDVI. The problem with this is that NDVI is also affected by land use. So it integrates two processes at once. There is an additional issue in irrigated areas. So while NDVI can, and should be considered in models at some point it is not the ideal input to a climate only based model, although taking the highest values provides a useful baseline for validation.
Empirical measures of PET are available from some climate stations. Unfortunately measures of pan evapotranspiration are notoriously inconsistent. This explains the absence of an evapo-transpiration layer in WorldClim. It would be unreliable if based on empirical data alone.
The Priestley-Taylor equation is often used to estimate PET over large areas. Although the simple form takes temperature and latitude as input, the equation does work with an estimate of net radiation so takes into account the seasonal variation of incoming radiation. It is implemented in the EcoHydRology package in R and can be adapted to run over Wordlclim layers.
## Input data
The input to the model is a 36 band Geotiff containing the Worldclim variables minimum temperatures (months 1:12), maximum temperatures (1:12) and precipitation.
library("dismo")
library("EcoHydRology")
clim<-brick("mintmantprec.tif")
names(clim)<-c(paste("mint",1:12,sep=""),paste("maxt",1:12,sep=""),paste("prec",1:12,sep=""))
## Calculating latitude in radians
The latitude of each pixel can be obtained from the coordinates. The function takes latitude in radians as input.
lat_rad<-coordinates(clim)[,2]*pi/180
## Calculating monthly PET
The following code takes the function in EcoHydRology and applies it to a day at the midpoint of each month (assuming 30 day months for simplicity as there is no need for false precision) .
for(i in 1:12){
evap<-raster(clim,1)
Tmax<-values(subset(clim,i+12))/10
Tmin<-values(subset(clim,i))/10
d<-data.frame(day=(30*i)-15,Tmin,Tmax,lat_rad)
d[is.na(d)]<-0
Es_PET<-PET_fromTemp(Jday=d$day,Tmax_C=d$Tmax,Tmin_C=d$Tmin,lat_radians=d$lat_rad)*1000
values(evap)<-Es_PET
if(i==1){PET<1){PET<<-addLayer(PET, evap)}
}
## Pattern of PET
months<-c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
names(PET)<-months
plot(PET)
## Estimating AET using a simple bucket model
Actual evapotranspiration is always lower than potential evapotranspiration. It can be much lower when the soil profile is well below field capacity. As soil moisture falls to close to the permanent wilting point AET will tend to zero, as no water can be extracted from the soil. The biggest challenge involved in calculating AET for large areas is that soil moisture is unknown. It can never be known accurately as so many factors are involved. However, after a run of days in which rainfall exceeds evapotranspiration soil moisture should rise towards field capacity. As this happens AET approaches PET. When rainfall ceases AET should drop towards zero as the soil profile gradually dries out. This is the basis of watershed models such as SWAT. A simple but effective way to represent this uses a “bucket model”. Once the bucket is full further rain leads to run off, and once the bucket is empty no more water can be lost. Watershed models are concerned with run off, so it is important to simulate pulses of input due to heavy rain. However a regional model can ignore this for the purpose at hand and simply add mean daily rainfall and subtract evapotranspiration to the bucket. AET can then be adjusted by an Alpha factor that mutiplies PET by 1 when the bucket is full and zero when empty.
In order to initialise the model a “burnin” period is needed in order to allow the bucket to find its level before monitoring the contents. The model is run at daily time steps and the results recorded at the mid point of each month in order to produce 12 layers to be added to WOrldclim.
Bucket<-raster(PET,1)
png("bucket%00d.png",height=800,width=900)
for (n in 1:2){
for (i in 1:359){
mn<-1+i%/%30
NewAET<-raster(PET,1)
NewBucket<-values(Bucket)
rain<-values(subset(clim,24+mn))/30
alpha<-(NewBucket-200)/300
evap<-values(subset(PET,mn))*alpha*0.8 ##A fudge factor for stomatal control.
NewBucket500]<-500
NewBucket[NewBucket<200]<-200
values(Bucket)<-NewBucket
values(NewAET)200)
if(n>1 && (i%%30)-15==0){
print(plot(Bucket,main=months[mn]))
plot(cont,add=T)
if(mn==1){AET<1){AET<<-addLayer(AET, NewAET)}
}
}
}
dev.off()
names(AET)<-months plot(AET)
And that very simple model seems to more or less reproduce a useful pattern that has integrated ALL the Worldclim variables in one simulation plus derived radiation patterns that result from latitude.
Importing Modis fire pixels into PostGIS
Modis data are “big data” in every respect. Both the temporal and spatial extent of the information obtained from Modis makes data processing rather challenging.
A quick overview of the Modis QuickFire data can be obtained by running the following code in R.
library(date)
for (yr in 2005:2012){
startday<-as.numeric(mdy.date(1,1,yr))
for (i in seq(1,370,by=10)){
URLBase<-paste("http://rapidfire.sci.gsfc.nasa.gov/firemaps/",sep="")
flnm<-paste("firemap.",yr,sprintf("%03.0f", i) ,"-",yr,sprintf("%03.0f", i+9), ".8192x4096.jpg", sep="")
URL<-paste(URLBase,flnm,sep="")
cat(URL)
datlab<-date.ddmmmyy(startday+i)
system(paste("wget",URL))
label<-paste("convert ",flnm," -background orange -pointsize 500 -gravity Center label:'",datlab,"' -append ",flnm,sep="")
system(label)
}
}
A complete archive of all pixels recorded as having possible fire activity (quickfire data set) can be downloaded using.
wget - r ftp://fuoco.geog.umd.edu/modis/C5/mcd14ml/* --ftp-user=fire --ftp-password=burnt
The files are strait ascii with the format given in the manual written by Louis Giglio that is available from the site.
The wget line will fill a folder with 562 files representing 3 GB of information representing suspected fires detected by Modis at a global scale dating from November 2000 to August 2012.
The data columns are
This can be simplified for many uses to the essential columns. Date (day) lat, lon, fire intensity and confidence.
So create a table to hold the key data.
CREATE TABLE modis_fire ( day date, lat double precision, lon double precision, frp double precision, conf double precision ) WITH ( OIDS=FALSE );
Now to import it all through R Assuming that you are in the directory where the data is stored and the download has all been unzipped (gunzip *).
Start R and add the following.
library(Date)
library(RODBC)
con<-odbcConnect("modelling")## Change this for your ODBC connection
f<-function(x){
d<-read.table(x,head=T)
day<-as.Date(as.character(d$YYYYMMDD),format="%Y%m%d")
dd<-data.frame(day,lat=d$lat,lon=d$lon,FRP=d$FRP,conf=d$conf)
write.table(dd,"temp.csv",,row.names=F,sep=",",col.names=FALSE)
fl<-paste(getwd(),"/","temp.csv",sep="")
a <-paste ("COPY modis_fire FROM \'",fl,"\' DELIMITERS \',\' CSV;",sep="")
odbcQuery(con,a)
}
Now the function should run through and load all the data into your table by just “lappylying” it to a list of files in the directory.
aa<-dir() lapply(aa,f)
However this table is very large (> 2GB). We can reduce it to the continent we are interested in and index it for North and South America.
create table modis_fire2 as select * from modis_fire where lon - 160 and conf > 70; SELECT AddGeometryColumn( 'modis_fire2', 'geom', 4326, 'point', 2); update modis_fire2 set geom = ST_SetSRID(st_point(lon,lat),4326); CREATE INDEX modfire_ind ON modis_fire2 USING GIST (geom); ALTER TABLE modis_fire2 ADD COLUMN id serial NOT NULL PRIMARY KEY; CREATE INDEX modfiredate_ind ON modis_fire2 (day); drop table modis_fire;
Now to look at all the fires in Chiapas since 2000.
There is clearly a lot more work to be done on this data, but it is now in a format that is more easily interrogated.
Loading raster data in PostGIS and visualising the results in QGIS
I have been using PostGIS (on and off) for several years now, so I am fairly comfortable with the basics of spatial SQL. However I only began looking at PostGIS raster a couple of weeks ago. My current knowledge of PostGIS raster is summarised in the figure below.
I am hardly in the position to provide authoritative advice on how best to use PostGIS raster. However, it is always easier to help others with problems while you are still in the process of solving them yourself. Once things become routine it is tempting to provide terse “RTFM” answers to beginners questions when you have forgotten that things are not really obvious at all.
In the case of PostGIS raster I think that it is far to say that the “FM” is still a work in progress. Those who write with the most authority on the subject are still very busy working on the code itself. So, in order to help postgrads and postdocs using PostGIS with me I have provided an explanation of how to load data into PostGIS raster, in the hope that it may also be of some use to others at the base of a rather steep learning curve.
I will use an example that may be helpful for one of the students working with me. The task is to load the GLOBCOVER2008 land classification for the Andes into PostGIS and then extract percentages of each land cover for each polygon representing the extent of occurrence of some species of interest.
The layer of interest is restricted to the Andes.
The layer I received had already been filtered in some way in order to include only pixels over 1500m. However the colour shown when I added it to a QGIS project flagged up an issue. A quick investigation using the value tool in Quantum GIS showed that this particular layer had been reclassified to give pixels below 1500m a value of zero, rather than clipped out so that they were treated as no data values. This would cause problems, as the zeros will then be treated as an additional (non existent) land use type if there is already a no data value (not zero) that is formally registered for the layer in the header.
So, it is always a very good idea to use gdalinfo before importing any layer in order to check the details. The syntax is simple.
In a terminal type gdalinfo followed by the name of the file. ( For newcomers to Linux you open a terminal in Ubuntu 12.04 by pressing control + alt + T. Then you can move to the folder where the data is using cd to avoid having to type the full path.)
gdalinfo /home/duncan/rasters/andes/altglob2.tif
So, just as I suspected there is a conflict. I was right to be suspicious. The registered no data value is 255, not zero. I have the full global coverage so I could just run the correct masking procedure. An easy way to correct the layer as sent is simply to use gdal_translate to produce a new layer with the correct no data value. This can be done directly on the command line, or in the GUI by choosing the relevant option in Quantum GIS from the raster menu.
gdal_translate -a_nodata 0 -of GTiff /home/duncan/rasters/andes/altglob2.tif /home/duncan/rasters/andes/altglob2a.tif
This sets the intended no data values to real no data values. The contrast between the pixels inside the rectangle and those outside now vanishes when viewed in QGIS.
Now that this issue has been fixed, the corrected layer can be loaded into PostGIS using a command typed, (or pasted), into the terminal.
raster2pgsql -s 4326 -d -I -C -M -R -l 4 /home/duncan/rasters/andes/altglob2a.tif -F -t 1000x1000 globcov1000x1000|psql -d latam2
In my laptop the command registers the raster layer that I have placed in my /home/duncan/rasters/andes directory into a database that I already have created in PostGIS called latam2. It also loads an overview layer for quicker, but less accurate, processing.
This long and complex looking command has to be broken down into bits in order to explain what is going on.
Each flag (the letters that start with a -) determines something about the way the data is loaded. In the very near future there will be a GUI to do this, but at present the only way of reliably loading a raster is through the command line. The official documentation is actually very clear on this and I found it very easy to follow. It is worth looking at in detail,
http://postgis.refractions.net/docs/using_raster.xml.html#RT_Loading_Rasters
The flags I usually include are
-d
This drops an existing table with the same name if there is one. As the data is all there in the Geotiff and is not lost this is not a dangerous option at all and it allows you to load the data again using a different tile size easily. So I always use it by default.
-I
Add a spatial index. This is very important for the speed of the queries. There are some special circumstances when you don’t want an index, but usually you do, so keep this.
-C
Enforce constraints. I.e check that everything is properly registered. A very good idea.
-M
Vacuum analyse the table. I.e tidy up as you go. Another good idea.
-R
This is an important flag. If you include it then the actually values of the raster layer are not placed in the database itself. They are “registered” and then read from the original file on disk when requires. This saves space in the database and also (counter intuitively) may actually speed up queries. However it means that the table cannot be transferred without the original file. Furthermore the file must be in the same absolute path if the table is moved from one computer to another using a backup. You must always give the full path of the file when using this option. If you later move the file then the connection is lost. So be careful.
-s 4326
This is the srid (coordinate reference system), which is usually going to be an EPSG code. It is a good idea to be explicit about this, although you must check carefully that you are not over riding the srid defined for the layer. Again use gdalinfo and check carefully.
-t 1000×1000
Critical This is the number of pixels per tile (so 10×10 is ten pixels by ten for each tile, 1000×1000 produces very large tiles) Understanding this is the key to everything. Providing the tile size is not exceptionally small (< 5 x 5) loading a raster with the -R flag will run through in seconds rather than minutes, even for large layers, so the process can be very easily repeated (up arrow in the terminal) if adjustments to the tile size are needed. In some cases different tile sizes are needed to optimise the speed of queries. In very general terms, the smaller the vector element that you are interested in overlaying on the raster, the smaller the tile size should be in order to optimise query speed. It is a little more complex than that, but this is a reasonable starting point. In this case we are interested in a lot of long thin convex hulls that tend to be over 1000 pixels in length. There is no reason for the tiles to be square, and in this case rectangles may prove to be more efficient, but I’ll leave that.
-l 4
This produces an overview layer at a coarser resolution using nearest neighbour resampling. The main intention behind this option is to allow faster visualisation of the layer. You can build many overviews in the same way that you build pyramids for a Geotiff. However, at the time of writing there is no simple way of actually visualise the results from this in Quantum GIS.
The option can still come in very handy however, as it allows you to run queries quickly on a lower resolution version of the data These can be used to test queries quickly or (if the results are good enough for the purpose in hand) they can be used as approximations. Even if the -R flag was used to register the original data outside the data, the layers produced using the -l flag are held within the database.
Overviews will appear in the database as raster layers with a prefix consisting of o_ plus the thinning factor (in this case 4) plus the table name. I.e. here it is going to be o_4_globcov1000x1000
The command specifies the name of the table that is going to be added to the database. I have used globcov1000x1000 in order to document the tile size, but you may have other ideas. It doesn’t matter what you call the table, as long as you remember its name.
So, the first part of this command (prior to the |) produces a query. This prepares all the data for loading.
The short cut that avoids loading the query as a an explicit second step is to pipe the results (using |) strait into a program (psql) that runs queries in the database from the command line.
So, the second part of the command that looks like this.
|psql -d latam2
This short version will work fine providing that you already have permissions to connect to the database and run queries. A longer, more generic, version would include explicit detail regarding the data connection, user name and password.
Although this may all seem extremely complex and esoteric, in practice once you have successfully loaded one layer it is very quick and easy to load another just be pressing the up arrow while in the terminal and changing any options that you need to.
If this has run through without errors then the layer is in now registered in your PostGIS database.
As with all command line stuff it is easy to use a previous command as a model for a later one. So things are not really as complex as they appear, providing you can sort out any initial issues.
But .. I expect you to have some issues at this stage.
If the command won’t run (which is more than likely if you have no experience at all with PostGIS) then the issue is likely to be one, or both of the following.
If you have just installed your first PostGIS data base and still are not sure how it all works then it is almost inevitable that your Linux username has not been registered in the list of database users. You still do not have any permission to use or change the data base until you register your name as a super user.
In this case you first need to add your name to the list of superusers of Postgresql. This can be done directly from the command line but it is probably much clearer if you use PgAdmin. When first setting up PostGIS you must have used the default postgres super user. Log in with that name and password and then add your name to the roles.
Once you are registered as a super user yourself you can then run any command that alters the database using pgadmin, the command line directly or even in Quantum GIS database manager without having to change to the default postgres user.
Another problem is very likely to occur at this stage arises if you have not properly loaded all the raster functions into your database. This may have happened if you missed out a step during installation. Unless you are aware of the structure of Postgis2.+ it is very easy to do.
So, if you keep getting “transaction rolled backed” type errors even after your name is registered as a super user this is the most probable explanation. Don’t panic, it is easily and quickly fixed.
In order to make the logic behind the fix more transparent and easy to follow I suggest that you use PgAdmin rather than the commandline. Open a query window with the database that you are using active.
Then browse to /usr/share/postgresql/9.1/contrib/postgis-2.1
You should see something like this.
Now if you load any of these sql files and run the queries through (green arrow button) you will add all the functions that they contain into your database. Although you do need to be a little bit careful if this a vital production database (i.e. backup first if the database contains all the projects data!) in general you can’t really damage anything at all this way. If the functions are already there you will just get messages to that effect.
If not, then the query will run through and will update your database with all the additional functions that you need and didn’t have previously.
So, make sure that all the raster functions are loaded and updated and then go back to the terminal and run the raster2pgsql line again. It should now work fine.
Notice that you can also use exactly the same procedure to spatially enable a non spatial database through PGadmin or to update to a newer version of PostGIS, providing that you have also installed the relevant compiled binaries, as will happen automatically when you update in Linux.
So, now, assuming that your layer has loaded, what can you actually do with it?
Unfortunately the gdal driver that the Quantum GIS PostGISraster plugin uses is not (yet) working properly so you cannot load the PostGIS layer itself in Quantum GIS. However don’t worry at all about visualising the raster layer through the database. There is no need to do this. You have a perfectly good GeoTIFF that you can use in your QGIS project to look at the layer. The reason for having the layer within the database is to interrogate it and add the results of queries into an analysis.
That said, just out of interest, it is worth looking at how the tiling works in order to understand the format. SO, open QGIS database manager and type this query.
select rid,rast::geometry from globcov1000x1000
This will (instantaneously) produce the geometry for each tile.
If you load the layer you can label the tiles by rid in order to get an idea of how it is all working. Each tile is a row in the database. The raster data themselves are held as a two dimensional matrix within each tile. This format is obviously useful for managing very large raster layers when using a mapserver Individual tiles can be selected and served up to users without having to load the whole layer.
The importance of selecting the right tile size when running computationally intensive operations such as polygon on raster overlays should also be apparent with an example. Let’s select one of the clipped convex hulls of interest.
The multipolygon only intersects 9 out of the 160 tiles. A correctly written query will therefore restrict its attention to these tiles, thus saving a lot of unnecessary computation.
In this case obtaining a count of all the pixels for each land use class that fall within the multi-polygon still takes over a minute, even though the operation is restricted to a few tiles.
NOTE: See comment on speed issues at the foot of the post. Installing the latest version speeds up this query by a factor of 10!
select
binomial, (pvc).value covclass, sum((pvc).count) pcount
FROM
(SELECT
binomial,rid,
ST_ValueCount(ST_Clip(rast, geom, true)) pvc
from
(select binomial,geom from
redlist_species where binomial like 'Baccharis_lati%') v,
globcov1000x1000 r
where st_intersects(rast,geom)
group by binomial,rid,geom
)foo
group by binomial,covclass
order by binomial,covclass
Why is such a simple task apparently so slow? PostGIS indexing should speed up queries! The reason is apparent on closer inspection. The clip in this case leads to many very tiny polygons that need to be aggregated, so the computer is doing much more work than you might at first expect. … But with the new ST_Clip it still runs in a reasonably short time
When PostGIS takes longer than you expect to complete a query it is always because you have given it more to do than you first realised (although in this particular case there are may also be some underlying performance issues with St_Clip that are being fixed HAVE been fixed!). Given the large extent of the polygons in comparison to the raster resolution you could simply take advantage of the nearest neighbour overview layer in order to use a sample of the pixels and run the query in a fraction of the time.
select
binomial, (pvc).value covclass, sum((pvc).count) pcount
FROM
(SELECT
binomial,rid,
ST_ValueCount(ST_Clip(rast, geom, true)) pvc
from
(select binomial,geom from
redlist_species where binomial like 'Baccharis_lati%') v,
o_4_globcov1000x1000 r
where st_intersects(rast,geom)
group by binomial,rid,geom
)foo
group by binomial,covclass
order by binomial,covclass
Now it completes in just 2 seconds, with 1 in 16 pixels being used. This would make little difference to the overall result in this particular case. There are so many other causes of uncertainty that make attempts at precision misleading.
PostGIS en la “nube”
Los que asitieron en el curso de bases de datos especiales en Ecosur vieron unos ejemplos de analisis con capas espaciales que estan basicamente libre y en el dominio publico (con la sola restricción de registrarte como usuario con el Conabio y el GBIF. HAY QUE HACERLO ANTES DE USAR EL VINCULO A LOS DATOS ABAJO).
EL mensaje del curso fue que se puede aplicar los metodos a tus datos particulares. Tambien se reforzo la idea de que en el mundo contemporaneo ya no es aceso a la informacion lo que es poder. Es mas bien saber que hacer con la información.. Entonces el analisis, mas que el acceso a datos, es la clave y es esto lo que hace la diferencia entre investigación original que produce conocimiento nuevo y simplemente curiosidad.
La diferencia mayor entre el uso de bases de datos espaciales en linea (o en una red local) y el procesamiento de datos al nivel local es la idea de compartir tanto datos y funcionalidad entre usuarios. Durante el curso repite la frase “no debes reiventar la rueda” constantemente. Igual puedo haber dicho “no debes pasar tiempo bajando y reformateando datos“.
La conectividad por el internet y el uso del “cloud” es lo que hace estos conceptos ya parte de la vida cotidiana. Asi que he puesto las capas basicas que usamos en el curso dentro de un servidor publico en la nube usando el plugin QGIS cloud de Sourcepole. Los resultados son visibles en tu browser siguiendo el vinculo abajo.
https://qgiscloud.com/ecosur/chiapas2
Atras de este mapa existe una base de datos en PostGIS puesto en el servidor de Sourcepole en Suiza. Se puede entrar la misma base (con algunos restricciones de uso) con los siguientes detalles.
dbname=’qadcox_tsnsjt’ host=spacialdb.com port=9999
user=’qadcox_tsnsjt’ password=’Pide los detalles y te los doy’
No se permite hacer funciones en la base, pero se puede usarlo en una forma libre para probar unos de los ejemplos en el curso si no tienes acceso a la base de datos institucional en Ecosur.
Espero que este sea util para los que tengan curiosidad de seguir aprendiendo SQL espacial en otros unidades.
























