Duncan Golicher’s weblog

Research, scripts and life in Chiapas

Connecting to a Postgresql data base from R (Ubuntu Natty)

leave a comment »

Just three quick steps in Ubuntu.

1. Install unixodbc and odbc-postgresql

sudo apt-get install  unixodbc

sudo apt-get install odbc-postgresql

2. Edit the blank odbc.ini file. You can just paste in the text below. The first entry is for my own local database. Change the details to match your own if you have one.  If you want access to the online Biotree.net database you need to substitute the real password for the ***** entry.

sudo gedit /etc/odbc.ini

[ODBC Data Sources]
mydb = Database description

[mydb]
Driver = /usr/lib/odbc/psqlodbcw.so
Database = gisdb
Servername = localhost
Username = postgres
Password = postgres
Protocol = 8.2.5
ReadOnly = 0

[ODBC Data Sources]
biotree = Database description

[biotree]
Driver = /usr/lib/odbc/psqlodbcw.so
Database = biotree_development
Servername = apps.iecolab.es
Username = biotree_adv
Password = *******
Protocol = 8.2.5
ReadOnly = 0

[ODBC]
InstallDir = /usr/lib

Step 3. Then in R, assuming that the RODBC library is installed

library(RODBC)

con<-odbcConnect(“biotree”)

You now design a query and run it to get your data.

d<-sqlQuery(con, “select * from gis.biotree_db where genus like ‘Quercus’”)
fix(d)

Note that if you want spatial layers from a PostGIS data base in R you can use rgdal to connect to the data base. The problem is that you can’t run queries from within R using rgdal. You can only import the whole layer. The work around is to use system (ogr2ogr …) with a temp file, but for most purposes it is easier to visualise and edit the layer in Qgis and save as shapefile for importing into R.

What is the equivalent in Windows?

It is a similar process but not quite as simple. You first need to install the odbc driver from http://postgresql.org . The easiest way is to download the msi file that installs itself.

 

Once the driver is installed you just need to set up the connector from the control panel, administrative tools,

You can use ODBC with R in the dame way, and you can also  load Postgresql queries into Excel if you find any reason to do so.

 

Written by Duncan Golicher

October 4, 2011 at 5:17 pm

Nearest neighbour in POSTGIS

leave a comment »

The issue to solve is how to find the name of the nearest blue point (settlements with a population over 100) from the red points (plant collections).

Some interesting ways of solving the problem are provided here.

http://www.bostongis.com/?content_name=postgis_nearest_neighbor_generic

There are some slightly easier way using the current version of PostGIS (1.8).

Breaking down the queries by parts. First find all the towns with a population over 100.

We will call this query “w” and we wrap it up as subquery.

(select * from mexico.conteo2005 where pobtot>100) w

Now we choose all the elements we want to see from the herbarium table together with a calculation of the distance between the two points. The original tables are in epsg:4326 so distances are in degrees. This doesn’t matter for our purposes, but it would be easy to include a CRS transform in PostGIS if we do want distances in km.

select h.gid, nom_loc,genus,species, h.the_geom, ST_distance(w.the_geom,h.the_geom) as dist from mexico.oldherbario h,

The search can be restricted to a distance of 0.1 degrees (approximately 10km) using

ST_DWithin(w.the_geom,h.the_geom,0.1)

And we want to order the results first by the herbarium collection ID and then by the distance.

order by h.gid,dist

Roll this all together and we get

select h.gid, nom_loc,genus,species, h.the_geom, ST_distance(w.the_geom,h.the_geom) as dist from mexico.oldherbario h,
(select * from mexico.conteo2005 where pobtot>100) w
where ST_DWithin(w.the_geom,h.the_geom,0.1)

order by h.gid,dist

So now we just need to get the first entry from each group and we have the nearest neighbour (with a population over 100 and within 10 km … the later criteria could be changed easily if needed, but the query would be slower). We can do that using select distinct. taking advantage of the fact that the function will return the first distinct id for each group, and the data is ordered by distance, so this is the nearest neighbour.

Less than 7 seconds is not bad. The query now looks like this

select distinct on(d.gid) d.* from
(select h.gid, nom_loc,genus,species, h.the_geom, ST_distance(w.the_geom,h.the_geom) as dist from mexico.oldherbario h,
(select * from mexico.conteo2005 where pobtot>100) w
where ST_DWithin(w.the_geom,h.the_geom,0.1)
order by h.gid,dist) as d

Finally if we wish to visualise the results we can form a new table or a new view.

create table chiapas.herblocalities as

select distinct on(d.gid) d.* from
(select h.gid, nom_loc,genus,species, h.the_geom, ST_distance(w.the_geom,h.the_geom) as dist from mexico.oldherbario h,
(select * from mexico.conteo2005 where pobtot>100) w
where ST_DWithin(w.the_geom,h.the_geom,0.1)
order by h.gid,dist) as d

If you are going to keep the table in the data base don’t forget to add entries in the geometry_columns table.

INSERT INTO geometry_columns(f_table_catalog, f_table_schema, f_table_name, f_geometry_column, coord_dimension, srid, “type”)
SELECT ”, ‘chiapas’, ‘herblocalities’, ‘the_geom’, ST_CoordDim(the_geom), ST_SRID(the_geom), GeometryType(the_geom)
FROM chiapas.herblocalities LIMIT 1;

Written by Duncan Golicher

September 25, 2011 at 5:17 pm

Installing PostGIS in Ubuntu Natty

with 2 comments

This will install PostGIS and add the R language through PLR in the current version of Ubuntu Natty 11.04

sudo apt-get install qgis
sudo apt-get install postgresql
sudo apt-get install postgresql-8.4-postgis
sudo apt-get install postgresql-8.4-plr
sudo apt-get install pgadmin3

One small step that is necessary is to change the user password for postgres. You can do this with psql. Care needed here. This has to be done correctly. The following line gets you into psql.

sudo -u postgres psql -d template1

Type this (being very careful with quotation marks and the semicolon).

alter user postgres with password ‘postgres’;

If successul you get a message saying ALTER ROLE (If there is any problem here then retype the single quotation marks to make sure they are simple. WordPress keeps changing them for some reason if they are not in an HTML box).

Now become the postgres user and create a postgistemplate. This is a blank data base into which you will load all the functions and tables needed for POSTGIS. All new data bases created using this template will have these functions loaded.

sudo su postgres
createdb postgistemplate
createlang plpgsql postgistemplate

The PostGIS functions and spatial_ref system are loaded from here on Ubuntu

psql -d postgistemplate -f /usr/share/postgresql/8.4/contrib/postgis-1.5/postgis.sql
psql -d postgistemplate -f /usr/share/postgresql/8.4/contrib/postgis-1.5/spatial_ref_sys.sql

This cpmmand loads in the R language (to use this you must have R installed)
 psql -d postgistemplate -f /usr/share/postgresql/8.4/plr.sql

To create a database under this template
createdb -T postgistemplate -O postgres gisdb

You will also have to alter the following config file

sudo gedit /etc/postgresql/8.4/main/postgresql.conf

Find the line with listen_addresses and uncomment it. If you want the database to be open to other users use’*’ instead of localhost
listen_addresses = ‘*’        # what IP address(es) to listen on;

Download a small test database with the countries of the word from this site with wget. Again the file  is disguised as a word doc in order to go into the wordpress site.

wget http://duncanjg.files.wordpress.com/2008/09/paises.doc

Restore this database. The database was built using an old version of Postgis, so you will see many errors which you can ignore (See http://blog.cleverelephant.ca/2010/09/postgis-back-up-restore.html

sudo -u postgres psql gisdb<paises.doc

Now run qgis and connect to your new PostGIS data base.

 

The trial database with a single countries of the world table can also be downloaded without wget by clicking here.

paises

Written by Duncan Golicher

September 24, 2011 at 4:37 pm

Posted in POSTGIS

Tagged with , , , ,

Updating to Ubuntu 11.04: Solving the sticking cursor

leave a comment »

I now expect Ubuntu to work strait out of the box so I was disappointed by a severe bug in the kernel that the current version of Natty uses.

uname -a
Linux duncan-HP 2.6.39-02063904-generic #201108040905 SMP Thu Aug 4 11:04:48 UTC 2011 i686 i686 i386 GNU/Linux

The issue makes the cursor periodically freeze on the screen for up to a second. The system can still be used, but the experience is extremely frustrating. It turns out that a process called kworker is using 100% of the CPU.  This can be fixed by following the instructions here, http://ubuntuforums.org/showthread.php?t=1796873.

Typing

sudo echo “options drm_kms_helper poll=N”>/etc/modprobe.d/local.conf

and rebooting solved the problem for me.

Notice that if you install this version you get a new “look and feel”. Not to my taste, but it is easy to go back to the old classic at log in.

Written by Duncan Golicher

September 24, 2011 at 2:14 pm

Accuracy of an Android cell phone GPS in the UK

with 4 comments

Even cheap smart phones in the UK now have GPS capability. The rapid expansion in spatially aware devices has been driven by the potential for commercial use of location data.  Location to nearest postcode is good enough for many of these purposes, but more accurate fixes are needed for Sat Nav. Contemporary phones therefore use a combination of network based methods and satellite GPS to obtain a fix. Accuracy, once a GPS fix has been obtained, is assumed to be within 10m.

Students carrying out field research projects could therefore potentially use their own mobile phones to log positions. However are phones suitable for scientific use? Do they provide reliable data? In order to be used for research is vitally important to have some measure of the reliability of positional information. How accurate are cell phone GPS fixes? How does accuracy vary?

I recently acquired two new, but low end Android phones (Hyundai Pulse Mini) for family use. These cost just thirty pounds (600 pesos) each, plus ten pounds phone credit. Almost unbelievably cheap. These phones are clearly bottom end devices. Nevertheless I have been pleasantly surprised by how useful they are.  They provide me with all the usual smart phone services I need, including perfectly acceptable Sat Nav and  good general mapping capability including the recording of apparently reliable track logs and waypoints in gpx format.  The Android operating system is very intuitive to use. The only real problem with the devices themselves is short battery life. But, could they be used for georeferencing research sites?

This was not an easy question to answer as I could find no detailed information online regarding the specifications of the GPS chipset that they contain. It is not even clear whether such phones use EGNOS (the European equivalent of WAAS). All cell phones have “enhanced GPS”, but this refers to the method used for improving the speed of the fix by using network signals in built up areas. EGPS does not improve accuracy as such. So the main issue for research applications is lack of knowledge regarding how precision may vary and what factors influence variability.

The android Market has a large number of useful GPS tools, including the default Google maps,a useful app called “GPS essentials” and the Locus app. A promising little tool for analysing positional error and correcting it is GPS averaging.This tool logs the position at regular intervals and calculates the mean position over a time period. If errors were random over time then the mean should provide a very accurate fix. I tested this by running the app for several ten minute periods.

The good news was that the overall accuracy of the GPS was always within 4 metres of the target. The bad news was that positional errors were correlated over relatively short time periods, so taking a mean did not increase accuracy by much, due to the overall bias over all the measurements.

First run. The mean distance from the known position was 3.28 metres.

Second run. The mean distance was 2.24 metres, but note that the bias has now shifted to the South East.


I tried this test several times over the space of a few days. The pattern was always similar. The GPS fixes tended to have a correlated bias over a period of up to an hour or more. Overall accuracy was always within 5m, but apart from removing the effect of odd extreme outlier, averaging did not improve the precision by much.  A more systematic study over a longer period would provide more quantitative details and even reveal patterns in the bias that may be associated with the satellite positions.

However, I had another idea. I have two very similar phones. It occurred to me that if the positional error turns out to be the same for both I could build my own differential GPS by connecting them using Blue Tooth and sending the position of one to the other. I couldn’t find a pre-built app for this so I tried out the Python scripting environment for Android. This can be downloaded and installed here.

I had never programmed in Python, so the code is very scrappy. However it was not difficult to adapt some of the code given in the example scripts in order to cobble together a very rough and ready proof of concept script.

It ran fine, but the results were  not as I expected. I found that each phone registered quite different positions when placed together (something I could in fact have easily tested before by taking a tracklog or using GPS averaging without  the need to write the Python script!). The error turned out to be as much as 6m, but tended to fall over time to around 50 cm over the space of twenty minutes under a clear sky. So the two phones could not really be used as a differential GPS, although the overall accuracy of either seems to be acceptable for many general types of field work that do not involve site surveying.

Anyway the Python code I came up with is provided below in case it is of use to anyone. The script can be downloaded by clicking here — differentialgps.py

There are some known issues with the code. If these were tackled it might be possible to improve the results I got. The most important defect was that the readLocation() method did not always return a result, so I used getLastKnownLocation() instead, in order to get the script working. This meant that the two fixes may not have been calculated at identical times. Even taking this into account the readings from the two phones had much larger differences than I expected. Another obvious issue with the general approach I used here is the limited range of blue tooth and the fact that I did not add any way of handling exceptions when the Bluetooth signal was lost.

The script must be placed on both phones and run in a console in SL4A. First run the script on one phone to make it a server then connect the second as a client. Close the server first for a clean exit. The results are written to two files on the client phone.

import android
import time
import math

droid = android.Android()
droid.startLocating()

fclient = open('GpsClientLog.txt', 'w')
fserver = open('GpsServerLog.txt', 'w')

droid.toggleBluetoothState(True)
droid.dialogCreateAlert('Be a server or client?')
droid.dialogSetPositiveButtonText('Server')
droid.dialogSetNegativeButtonText('Client')
droid.dialogShow()
result = droid.dialogGetResponse()
is_server = result.result['which'] == 'positive'
if is_server:
 droid.bluetoothMakeDiscoverable()
 droid.bluetoothAccept()
 timelapse = droid.getInput('GPS server', 'Enter number of seconds between readings').result
 while True:
  time.sleep(int(timelapse))
  event=droid.readLocation()
  location = droid.getLastKnownLocation().result
  if location['gps'] is not None:
   location = location['gps']
   method="gps"
  else:
   location = location['network']
   method="network"
  lat=location['latitude']
  lon=location['longitude']
  print method+','+str(lon)+','+str(lat)
  fserver.write(str(lon)+','+str(lat)+'\n')
  droid.bluetoothWrite(method+','+str(lon) +','+str(lat)+ '\n')
 fserver.close()

else:
 droid.toggleBluetoothState(True)
 droid.bluetoothConnect()
 while True:
  ServerPosition = droid.bluetoothReadLine().result
  print ServerPosition
  event=droid.readLocation()
  location = droid.getLastKnownLocation().result
  if location['gps'] is not None:
   location = location['gps']
   method="gps"
  else:
   location = location['network']
   method="network"
  lat=location['latitude']
  lon=location['longitude']
  slon=float(ServerPosition.split(',')[1])
  slat=float(ServerPosition.split(',')[2])
  dist=round(math.sqrt(math.pow((slon-lon)*69930,2)+math.pow((slat-lat)*111122,2)),2)
  print "Approximate distance="+str(dist)+"metres"
  print method+','+str(lon)+','+str(lat)
  fclient.write(time.asctime()+','+method+','+str(lon)+','+str(lat)+'\n')
  fserver.write(time.asctime()+','+ServerPosition+'\n')
 fserver.close()
 fclient.close()

Written by Duncan Golicher

May 8, 2011 at 6:47 am

Import Worldclim data into GRASS

with one comment

Importing wordlclim layers (http://www.worldclim.org/) into GRASS should be straitforward using r.in.gdal. However the data have to be unzipped and  a line has to be added to the headers.

Assuming that the file to be imported is mean temperature at 2.5m resolution

This line unzips your file

unzip tmean_2-5m_bil.zip

Make sure the region is set.

g.region n=89N s=89S e=180E w=180W res=00:02:30

This line fixes the problem with the missing header

for i in $(seq 1 12); do echo “PIXELTYPE SIGNEDINT” >>tmean$i.hdr; done

This line will then import the data into GRASS: Notice that you need to change the name from tmean to the appropriate name in two places (for example input=tmin$i.bil,output=tmin$i)

for i in $(seq 1 12); do r.in.gdal -o input=tmean$i.bil output=tmean$i; done

Finally remove the unzipped files

rm tmean*.bil
rm tmean*.hdr

Click here to download a PDF showing screenshots of the steps.

Click here for a flash presentation.

Written by Duncan Golicher

August 9, 2010 at 4:56 pm

Interpolation of a Digimap DTM using GRASS

leave a comment »

First of all I apologise for the lack of new posts to this web log after moving to the UK. It has been hard to find the time.

I am currently preparing some tutorial material for QGIS/GRASS. One of the first tasks when using data from the DIGIMAP resources is to interpolate a DTM from vector data.

Clicking here will run the tutorial as a flash video (YOU SHOULD USE FULL SCREEN MODE FOR BEST RESOLUTION):

Clicking here will provide a PDF of the tutorial. Again it should be opened in full screen mode. This is probably the best way to follow the steps.

Link to data for the tutorial

Written by Duncan Golicher

July 24, 2010 at 6:07 am

June 2009

leave a comment »

Click on the thumbnails below to go to albums of the family in Wareham.

The barn

thebarn

June2009
corfe
WeymouthMarch
From WeymouthMarch

Written by Duncan Golicher

June 6, 2009 at 4:56 pm

Posted in Uncategorized

La familia Fortolís

leave a comment »

Written by Duncan Golicher

December 28, 2008 at 11:57 am

Christmas with Mickey and Nicole

leave a comment »

Two short videos.

Written by Duncan Golicher

December 21, 2008 at 12:41 pm

Christmas play

leave a comment »

Here is Nicole as a bee in the Christmas play in Pequeño Sol.

Written by Duncan Golicher

December 21, 2008 at 12:35 pm

Posted in Uncategorized

How Lyx works with sweave

with one comment

Below is an animated Gif that should show the way R code can be embedded in a Lyx document and compiled to form the document on reshaping data mentioned in a previous post. Click on the image to run the animation. Click here to see the final document reshapingdata

reshape
The Lyx document can be downloaded here. Change the extension from .doc to lyx.
reshapingdata-copylyx

Written by Duncan Golicher

December 19, 2008 at 10:03 am

Posted in R scripts

Tagged with ,

Blanca navidad

leave a comment »

Written by Duncan Golicher

December 13, 2008 at 7:15 pm

Posted in Uncategorized

La casa

leave a comment »

Written by Duncan Golicher

December 9, 2008 at 3:20 pm

Posted in Uncategorized

Reshaping data in R

with 5 comments

The text below lacks graphics and tables, However just click on the blue first paragraph to download it all as a PDF .

Duncan Golicher

One of the most frustrating and time consuming parts of statistical analysis is shuffling data into a format for analysis. No one enjoys changing data formats. Researchers want to get results, finish the task, move on. Routine reformatting of data is made difficult by complications that could have been avoided. Students who are not instructed in data management, or those who ignore instruction, use spreadsheets to hold data. This practice leads to confusion. The reshape package in R not only help to rescue data lost in overly complex Excel spreadsheets, it changes what could be a boring cut and paste operation into an interesting intellectual exercise in its own right.

Students should be taught to collect and hold data in a standardized “long” form in which there is only one variable for each form of measurement from the start. This resolves most problems and saves hours of tutors time. However even of they do, they still need help in reshaping this data to their own needs. This is especially true if they use SPSS. Data reformatting in SPSS is very complex and confusing and SPSS does not have powerful generic routines for handling very complex operations.

There are many tricks for handling data in R using tapply, lapply, stack, aggregate, by and a range of other functions. However R now has an efficient and powerful syntax for data handling provided by Hadley Wickham’s elegant reshape package. I have only just discovered reshape and I am very impressed. Reshaping operations were always far easier in R than in a spreadsheet, but that wasn’t to say that they were ever that easy. With reshape they are a lot simpler. I now think I can safely trust my students who use R to be able to handle most of the cases they come accross with a little thought and guidance. As an example here is some data extracted from tables of goals scored in the 2006 -2007 and 2007-2008 football season.

http://www.soccerstats.com/adv_scoring_times.asp?league=england

The number of goals scored by each team are identified by the fifteen minute period of the match (0-15 = m15, 15-30 =m30 etc) and the year when the season started. This is not raw data. That would consist of the exact times the goals are scored identified by team. The data tables available on the web site are in a wide format. There was an issue caused by the fact the original tables hold for and against goals together. This had to be resolved using strsplit. However this is a different matter that I don’t go into here. The reshaped data looks like this. Note how xtable can provide nicely formatted tables for reports.

Visualizing data in long format using R

If data is in this standard “long format” everything is just so easy. For example after loading lattice a box plot is produced simply by using the standard model formula.

Or the boxplots can be placed the other way around.

The plotmeans function in gplots is very useful for plotting confidence intervals. Notice this is not Hadley Wickham’s ggplot package, it is Greg Warnes gplots. This has a very convenient function for plotting confidence intervals for means, one of the more difficult common operations in R for students to achieve.

Modelling data in long format

It is particularly nice to have data in long format because the way we visualize the data using formulae corresponds directly to a simple linear model for the data

There is clearly one significant main effect. This is the time in the match. Neither season nor interaction is significant.

The effects package can be used to form a very quick graphical picture of the model that is also an effective alternative to using plotmeans.

Notice that the main message is that more goals are scored before half time and at the end of the match. Some of this may be attributable to a few minutes injury time being added on, but there is a upward trend anyway in both halves.

Reshaping data

The standard “long” format unifies data analysis and visualization in R. Unfortunately we do often need to switch between this convenient long format and wider formats for presentation of data as tables and some forms of analysis.

How does Hadley Wickham’s package reshape help us to do this?

Full technical details of the package are available here

http://www.jstatsoft.org/v21/i12/paper

To kick start the process of using reshape I will provide practical examples as templates, although to understand its full power I recommend the original formal documentation.

The logic of the package is first to “melt” the data into a form that is consistent for any type of data. This is similar to the long form of the data except that one column is used to store the name of the variable and one for the value. You thus always have a number of columns that correspond to the identifying (grouping) variables + 2.

To melt the data you need to specify which are the variables used for identifying cases and which are the measurements. These can be specified as the names of the columns or their numbers. It is often quicker to use numbers.

As Hadley notes, getting data into this form is not always easy. However any properly designed “long” data set will melt like butter. Wide data sets are sometimes more challenging. Once you have the data in this standard “molten” form Hadley Wickham’s formula based approach takes over and is used to reshape it to almost any form you could want. There is something quite magical about seeing data turn itself into exactly what you need with just two lines of code, especially if you have had sufficient previous experience with Excel to understand just how time consuming it can be without reshape to help. Even with tapply, by and aggregate in R things can often be tricky. Under Windows data can be moved from and to Excel through the clipboard, so an R window can easily do the work that pivot tables previously managed in a less elegant way.

The logic of reshape is rather challenging at first. There is no free lunch. It does take a little bit of getting used to, but once mastered it is just so powerful that you will never look back. Not only can you turn long tables into wide tables, but you can easily apply aggregations using the same logic. The trick is that the variables you want as the column names are placed in the second part of a formula. Understanding this and the how to use two special variables “.” and “…” is the key to using reshape. The “…”syntax means “all the variables” and “.” means none. This if all the variables are used to form column names a “wide” format results. Look at the example on the next pages.

Moving to a wide format

Reshape the whole table to a wide format using all the variables. Time will form the columns holding the measurements (number of goals). Notice the way … is being used in the formula.

Aggregating

Sum the goals for each team over the two seasons with Time as column header. In this case note that some teams only played in the premier league for one season. Reshape can quite easily produce a simple count for this by further aggregation and this can be used to subset the data.

Including margins

Margins can also be calculated using reshape. This is a nice feature in some cases when you need to display the data. A bad part of student’s use of Excel is their frequent inclusion of marginal totals within the columns in which the data is held. This breaks a fundamental rule. Everything in a column of data should be the product of the same measurement protocol. However marginal totals are often useful when the data is presented in a final form. This is the right time to calculate them and it is easy with reshape.

Moving from wide to long format

This is one of the most complex operations to get right using the traditional stack. There is often a need for renaming column headings. However it works fine with reshape.

A simple summary of forest inventory data using reshape

A common task is calculating summed basal areas and counts of individuals from forest inventory data. First lets make up a very short simulated data set with just four species in ten plots. The procedure for reshaping data on 400 species in 10,000 plots is exactly the same and takes no more time, although the resulting table would make a very long appendix! Notice that the species names are first kept as a separate table then merged to form the data. It is very important always to do this to avoid mistakes. I often seen students typing species names multiple times, which is clearly a recipe for disaster.

Counts of individuals are simple as the function “length” counts the number of observations in each cell. Notice that you must not include margins=T if you are going to work with the data as a matrix for multivariate analysis as this produces the row and column totals.

More interestingly we can apply a function that calculates and sums basal areas directly to the melted data in the same way.

reshapingdata

Written by Duncan Golicher

December 9, 2008 at 9:57 am

Nicole sings the Madagascar song

leave a comment »

Written by Duncan Golicher

December 7, 2008 at 11:38 am

Testing for normality 2

leave a comment »

As a follow up to the previous post on testing for normality I have put together my take on the solutions to the issues as a set of guidelines available here.

This document clearly needs revision and expansion but I hope it may be useful

guidelines

Written by Duncan Golicher

December 4, 2008 at 9:06 pm

Naturalización como ciudadano Mexicano

with 3 comments

El viernes por fin pasé mi examen de naturalización como ciudadano Mexicano. Estoy muy orgulloso del logro. He querido tomar la nacionalidad del país desde 2005 cuando apliqué. Desde entonces, la Secretaria de Relaciones Exteriores ha perdido mi expendiente en multiples ocasiones y todavía estoy esperando la carta que formalice mi estatus.

Por favor ¡vota! No te cuesta mas de un segundo.

Otro detalle sorprendente fueron las 101 preguntas que tenía que memorizar para asegurar éxito en el examen. La guia de estudio cambió desde que estaba invitado a tomar el examen, asi que tuve que hacerlo dos veces. La segunda guia fue bastante menos lógica que la primera.

Aqui esta la guia de estudio  guia_estudio

El procedimiento es muy arbitrario. El candidato para naturalización recibe cinco preguntas aleatorias de esta lista esoterica. Se permite equivocarse en una. Las preguntas obviamente tienen poca relevancia como forma de evaluar la integración del candidato en la sociedad Mexicana. Hay algunas ambiguas y mal redactadas. Por ejemplo “Que era la cuenca larga”. Supongo que queria decir cuenta.

Si tienes otras sugerencias dejalos como comentarios abajo.

Aqui hay algunos ejemplos mas

Written by Duncan Golicher

November 30, 2008 at 8:34 am

Kolmogorov-Smirnov tests of normality

with 3 comments

Click here for an explanation of the animation below

ks

Today I had an interesting exchange regarding tests of normality when teaching introductory statistics.

I have a dilemma. The point is that I do not place a great deal of stress on tests of normality when I teach Master’s students, although I mention that they are often used. However in introductory statistics texts tests of normality are given a lot more attention, presumably in order to ensure that students are aware that normality is an important assumption for many statistical procedures. I’m all in favour of testing assumptions. But do students really know what assumptions they are testing?
I have to teach introductory statistics without confusing students or sending mixed messages. It is therefore quite a delicate matter that needs clarity.

In fact none of these statements are accurate (and its Smirnoff you are thinking of!). My own  preference is to try teach students to understand why any underlying population or sampling frame might not be normal.They should also intuitively understand how the procedure used for sampling from the population may influence the properties of the sample drawn from the populations.

These properties are then treated as expected before beginning any field work. All data transformation or use of non-parametric tests are pre-planned as part of the formal protocol designed for data collection and analysis.

I really do not like any post-hoc alterations to a planned work scheme after the data are collected. At best they waste time, at worst they lead students to think that the data themselves are somehow “invalid” and thus unpublishable.

I therefore quite strongly dislike including post hoc tests of normality within the work flow of the analysis as a knee jerk procedure with a yes/no answer. This certainly does not suggest that I tell students to assume that all the preplanned analyses are necessarily valid, nor to accept that inference on the mean can be conducted without checking assumptions.

The alternative to automated tests of normality is to make sure that students always visualise the distribution of their data fully in order to understand why any assumption of normality may be wrong. I also try to encourage students understand how and why data transformations might work. Again this is usually most helpful before data is collected, but it is also a way to deal with major surprises.

Here again is the link to the pdf document I wrote that suggests a possible answer to the poll.

ksdemo2

Click on the link above as it is easier to include PDFs in wordpress this way.

An here is a quick test of any interpretation of the results of a KS test of normality.

Just to summarise the well known reason to avoid testing for normality. If you draw a very large sample from a slightly non-normal population the test tends to provide low p-values. You should presumably reject the null hypothesis that the data could have come from a normal population and according to a strict interpretation you then can’t use your planned analysis as it would be “invalid!

However if you draw small samples from very non normal populations (as shown in the pdf) you will not reject the null hypothesis as often, even though the methodology will provide misleading inference.ksdemo3

Draft beginners R course

with one comment

As a follow up to today’s post on Lyx I have produced two presentations for a beginners course using Lyx/Beamer/Sweave.

I developed a very simple working protocol for making Beamer presentations with sweave. Once everything is set up it is simple and productive. I use the standard beamer commands for any slide without R code. I don’t ever try to mix beamer features and R code on the same slide.

Here are two introductory presentations using this approach.

presentation11

presentation2

And the draft document (with gaps)

rcoursedraft1

The source is in the zip file below (remove the .doc and replace with.zip)

r-coursezip

srsclass2

srsclass3

srsclass41

srsclass51

srsclass6

Written by Duncan Golicher

November 21, 2008 at 12:15 am

Lyx and Sweave: Worth climbing the learning curve?

with 8 comments

One of the interesting elements of using Linux is the demystification of many *nix concepts that you might have come into contact previously in a lateral manner. One of these is Latex. Any R user in Windows quickly becomes aware that Latex exists. However Latex is fundamentally a *nix thing. It has even been claimed by Windows users that Latex is a legacy typesetting paradigm!

There are certainly some issues with the complexity of Latex. However there is no denying that Latex documents look impressively formal. Personally I think it is worth the effort to look seriously at latex, even if you are a complete newcomer to Linux.

I was curious about Latex. However I readily admit that I really couldn’t be bothered to learn latex as such. Life is far too short. Fortunately I found Lyx. Lyx is Latex for the lazy. No need to learn all the details, it is advertised as a WYSWYM latex typesetting program (What you see is what you mean).

So how do you use Lyx? The first important tip is to make sure that you install ABSOLUTELY, BUT ABSOLUTELY, EVERYTHING remotely relevant to TEX and Latex before starting anything with Lyx. Fonts, languages, the lot. This  avoids later frustrations with missing sty files etc. I can’t remember offhand all the packages I needed, but do go to Synaptics and look for texlive-full plus anything else remotely related with either tex or latex. This will mean quite a long initial download but you certainly won’t regret it. Without all the Tex/Latex stuff installed the Lyx experience can be frustrating. You are likely to get at least a few annoying and incomprehensible error messages when compiling documents. This could give you the feeling that the whole system is more trouble than it is worth. You will add about 500 MB to your install in total, but that shouldn’t be a problem.

The next step before using Lyx seriously for work is to analyse your own ability with Word or Open Office. The idea of using Lyx is to save time in the long run. Experienced Office users can probably replicate everything Lyx or Latex can do. However many of us do have problems writing long documents such as theses or technical reports in Office suites. It becomes very difficult to maintain a consistent style. I now find Lyx easier to use than any office program for long document. The logic of using Lyx is to improve productivity, not make simple tasks much more difficult. Define what was most difficult for you to use in an Office suite and find a consistent way to achieve it in Lyx. My main problems were with positioning figures on the page and producing correctly structured tables of content. Both are very easy in Lyx once you have found out how, but there is an initial learning curve.

Then follow the examples in the Lyx documentation and allow some time for experimentation.

Sweave in Lyx

Most R users will  be very interested in the use of Sweave with Lyx. This has been made possible by the work of Gregor Gojanc. I have only recently realised just how cool this can be. Once all the Latex extras are installed in Ubuntu following the instructions in “INSTALL” here worked well.

http://cran.r-project.org/contrib/extra/lyx/

A faster way of getting to this point would be to download the file below which contains my .lyx directory hidden as a zip file with fake doc extension. Replace the whole of the .lyx directory you have in your home directory by mine. Remember to use control H to see hidden files. You might want to back up your original directory first, but the replacement shouldn’t cause any issues.

lyxzip1

Then do tools/reconfigure in Lyx and you should be able to use Sweave. My directory also has a layout incorporated for making beamer files that I got from here.

http://ggorjan.blogspot.com/2008/09/using-beamer-with-lyx-sweave.html

Here is an example

beamersweave

The source in Lyx that made this is available here (again take of the doc extension and change it to .lyx), You’ll need beamer-latex installed to compile it and you will have to provide your own logo picture.

beamersweavelyx

And here is an early draft of a course for beginners to R that I am writing in Lyx/Sweave. If the documents compile then you have everything installed correctly.

rcoursedraft1lyx

Again there is a bit a sharp curve up to get Lyx and Sweave working. You will need to read the sweave manual first (A Google for Sweave will provide other material)

It is admitedly hard to get Sweave/beamer working as this is still quite an experimental concept. I found that not all the beamer features work, but you can still get some nice looking slides. The easiest way to use beamer for the time being is to adapt my template by cutting and pasting the box of “ERT”  for every new slide, changing the R code within them to what you need. Some styles such as Title and Author stop the document compiling, so don’t change the title page much. Remember to put your own logo in the graphic. The huge payoff in terms of time saved producing slides is made by \begin{frame}[shrink,containsverbatim]. The shrink option automatically adjusts R output to the page. Great as long as you don’t print a huge object. Also notice the use of xtable and <<results=tex>> in code chunks to produce formatted tables.

Inserting a histogram into a normal Lyx/Sweave document like the R course is very simple. In an ERT box you just need to write.

<<fig=T>>=

hist(x)

@

The great thing about this way of working is you know all the R code you are showing actually runs, as they document won’t compile if you’ve made a typo or syntax error.

I will be mentioning Lyx with bibtex in a later post as this is another potentially major productivity booster that may need some help to use at the start.

A presentation for the first class on the introductory R course I am writing is now available here

presentation1

Written by Duncan Golicher

November 20, 2008 at 8:28 am

Posted in Linux, R scripts

Tagged with , , , , , ,

Accidente 2: El valor de la verdad en Chiapas.

with one comment

Ayer perdí todo el día de trabajo en el Ministerio Público y la Delegación de Tránsito “arreglando” el asunto del accidente de Adriana.

Dado el hecho de que el ajustador del seguro  Homero Trujillo parecía tener mas simpatía con el grupo de taxistas que con sus clientes, contactamos a la agencia de seguros para retirarlo de caso y mandaron una abogada de Tuxtla para intervenir en su lugar.

Decidimos que haríamos todo lo posible por evitar procedamientos largos en el Ministerio Público, con la exepción claro, de nunca mentir sobre la causa del accidente. Los hechos ocurrieron exactamente como mostrado en la maqueta y había varios testigos del evento.

Al final la Licenciada Ferrara  resultó muy profesional y ademas agradable. Ella entedió nuestra posición, pero también entendía bien las reglas del juego en SCLC. Logramos un arreglo en el cual aceptamos que nuestra aseguradora  pagaría el daño al taxista pero sin aceptar reponsabilidad por causar el accidente.

Para liberar el coche de corralón yo tenía que ir a la Delegación de Tránsito. Fue allí que las cosas empezaban a complicarse de neuvo. El Sub-director de tránsito insitió que teníamos que pagar una infracción por haber causado el accidente. Esta insistencia fue en contra de nuestra decisión de respetar siempre los hechos. Discutí el asunto, explicando que Adriana no me ha contado una mentira en todo el tiempo que la conozco y no creo que ella ha cambiado de su costumbre. El señor insistió que su agente había hecho un periitaje correcto, y los “hechos” contados por el taxista eran verificables. De acuerdo de este versión no había un tercer coche en el accidente, Adriana había cometido una imprudencia “impidiendo el tránsito” y que si había testígos padres del Pequeño Sol, esos no valían, pues son todos iguales.

Hay ciertos limites a mi capacidad de denegrarme para tener la vida facil. Claramente no estaba dispuesto de aceptar ese dictamen, asi que parecía que estabamos condenados a regresar al Ministerio Público. Pero en el último momento salió a la luz otro detalle. La licencia de conducir de Adriana estaba vencida. Esto ocurrió porque nunca hay material para renovar licencias en SCLC.  La oficina de transito aparentemente acaba de recibirlo pero hay que hacer cita, La fecha mas cercana es a mediados de noviembre. Mi esposa estaba esperando todavia la cita. Pero el hecho de que su licencia estaba vencida era un hecho verdadero. Asi que yo dije al Subdirector que estaba dispuesto de pagar la infracción si dejaba de acusar a  mi esposa de haber causado el accidente. Todos felices. Aparte del detalle que el taxista, un hombre irresponsable y un peligro para otros conductores y pasajeros, salió completemente sin culpa del asunto.

Written by Duncan Golicher

September 27, 2008 at 5:37 pm

¿Software libre para el UNICH?

with one comment

La Universidad Intercultural de Chiapas es una institución nueva en San Cristóbal con la misión  de proporcionar mas oportunidades de educación superior de calidad a los jóvenes de la zona. El término “intercultural” viene del reconocimiento que la zona de los Altos de Chiapas esta compuesta de diversos grupos étnicos.

La UNICH recientemente me contactó para ayudar a analizar una propuesta para un laboratorio de información geográfica. El presupuesto inicial fue de 809,000 pesos. Una parte sustancial de este gasto (135,000 pesos) estaba en el rubro de Software. No estaba bien calculado, dado de que la intención fue comprar solamente una licencia para cada elemento. Este corría el riego de uno de dos resultados. 1) El uso muy restingido del laboratorio. 2) La tentación de usar software sin haber pagado correctamente por los derechos. Hoy en día herramientas geográficas deben estar puesto a la disposición de todos, particularmente los jóvenes procedentes del ámbito rural alrededor de San Cristóbal.

Hoy escribí una nueva propuesta para el laboratorio basado en el uso de Software libre. Esta disponible haciendo clic abajo.

unich

Written by Duncan Golicher

September 27, 2008 at 3:13 pm

Como hacer un gif animado en Linux

with 4 comments

Curiosamente algunos personas ya me preguntaron no solamente sobre el acontecimiento de ayer,

http://duncanjg.wordpress.com/2008/09/25/accidente-en-el-cruce-del-pequeno-sol/

Tambien querian saber como hice la maqueta! Es muy facil y rapido en Linux.

Instalas imagemagick

sudo apt-get install imagemagick

Pon las fotos que quieres convirtir un directorio. Cambia a este directorio. Las fotos  deben estar numerado  en el orden  que quieres. En el caso de mi camera son de *.JPG

Este linea luego produce el gif.

convert -geometry 400 -delay 100 -loop 0 *.JPG choque3.gif

Cambia el 400 si quieres un imagen mas grande o mas pequeño y el 100 (microsegundos) de pausa entre cada uno.

Muy facil. Ademas mucho mas rapido que entrando en un programa que te pide importar y exportar las fotos y arreglarlos en pantalla. En este caso nada mas tenia que abir dos de las fotos antes en Gimp para añadir las luces de la camioneta.

Aqui hay un imagen mas alegre usando este comando.

Accidente en el cruce del pequeño sol

with 2 comments

(El siguiente episodio esta disponible aqui http://duncanjg.wordpress.com/2008/09/27/accidente-2-el-valor-de-la-verdad-en-chiapas/)

Hoy nuestra famila sufrió un evento muy estresante. Adriana esta involucrada en un accidente de tránsito en el  cruce del Pequeño Sol.

Afortunadamente el choque por si fue relativamente leve. Ni Adriana ni Mickey estaban lesionados. Hasta el daño al coche no fue tan severo. Pero desgraciadamente la forma en la cual el policía de tránsito y el representante del agencia de seguros manejaron el asunto ha dejado mucho daño psicológico.

Adriana estaba parada en el crucero esperando que una camioneta diera vuelta en frente de ella a la izquierda. El conductor de la camioneta dio señales con sus faros para dar  prioridad a Adriana de dar vuelta a su propia izquierda. Adriana no estaba convencida y no se movió. El conductor dio el señal una segunda vez. Entonces ella empezaba a cruzar. En este momento un taxi  atrás de la camioneta se desesperó y aceleró  a toda velocidad rebasando ilegalmente por el lado derecho de la camioneta. El taxi (con matrícula 31-68- BHC y calcomanía del alien)  embistió el auto de Adriana. Había múltiples testigos del accidente. Todos conicidieron exactamente con la versión mostrada en la animación en esta página. No había  duda sobre lo que pasó y tampoco había ninguna otra explicación viable al accidente. Desafortunadamente el conductor de la camioneta se fue,  evidentemente no quería involucrarse. Estamos buscandolo para ayudar en aclarar los acontecimientos a los autoridades. El oficial de transito niega aceptar la existencia del tercer vehiculo dado de que no estaba presente cuando el llego.

El taxista claramente había actuado en una forma muy imprudente y peligrosa, dejando que su impaciencia dominaba todo sentido común. Adriana no podía ver el taxi por el tamaño de la camioneta. Además ella estaba señalando que daba vuelta a la izquierda. El taxista ni la vió.

Al principio el evento parecía un poco triste y frustrante pero fácil de resolver. El coche estaba asegurado y además Adriana no tenía la culpa. Sin embargo se complicó mucho al llegar el policía de tránsito, el agente de seguros y un montón de compañeros del taxista. Mientras Adriana todavía estaba en estado de shock ellos hablaron entre sí. Dijeron que Adriana había causado un accidente por “impedir el tránsito libre”. Al final no daban otra alternativa a Adriana mas que firmar un papel en el cual ella se responsabilizaba del evento y en el cual ella solo estipuló:  -acepto la “parte de responsabilidad” que me corresponde.

El último daño a nuestra tranquilidad pasó a las cinco de la tarde. Ya Adriana estaba tan estresada  que no podía enfrentar la situción mas. Yo fui a una reuníon en la delegación de tránsito con el represesntante de seguros Tepayac, un tal señor  Homero Trujillo.

Explique mi inquietud sobre la forma que mi esposa había sentido forzada de aceptar responsiblidad por un accidente que ella no causó, en contra del interés de su propia compañía de seguros de evitar pagos innecesarios. El me explicó claramente que no había alternativa aparte de pasar dos meses sin coche esperando un dictámen legal del Ministerio Público, el cual seguramente ibamos a perder por el poder político del dueño del taxi como líder del cooparativo.

Lo curioso es que a pesar de la rabia que sentia acepté los argumentos por pragmatismo. Dije al Sr. Trujillo que firmaré culaquier papel para que no se alarga el asunto. Necesitamos el coche. Al entrar en un proceso legal se puede retener en el corralon durante varios meses. Pero el Sr. Trujillo claramente ya tenía otra agenda distinta. Salió a llamar por celular al dueño del taxi. Cuando llegaron los taxistas ya no estaban dispuestos de hacer nungún trato. El señor Trujillo  dijo que nosotros no debemos  preocuparnos sobre el costo del proceso legal largo que se estaba empezando, la compañía Tepeyac pagará todos los gastos asociado con los procedimientos legales. Pero no tendríamos el coche durante meses y tendríamos que pasar varios dias de cada semana esperando turno en el ministerio.

Si yo fuera accionista en la compañía Tepeyac estaría cuestionando mucho esta forma de operación. Una compañía de seguros tiene la obligación de no aceptar un pago innecesario. El taxista no estaba asegurado, asi que para nosotros sería mas difícil recuparar el daño al coche si estabamos esperando el pago por parte del dueño del taxi. Pero estariamos dispuesto de aceptar este riesgo si la taxista nos da la razón. El causo el accidente, no fue Adriana. La retención del coche en el corralón (500 pesos por día) nos cuesta mucho mas en tiempo perdido y obviamante mucho mas a  la misma compañía de seguros. ¿Cuales intereses realmente representa Sr Trujillo?¿Los avogados, la compania Tepayac, o otros?





x

Written by Duncan Golicher

September 25, 2008 at 11:40 pm

Follow

Get every new post delivered to your Inbox.