Some challenges of scraping cricket data from online sources

Some challenges of scraping cricket data from online sources

Duncan Golicher

2017-8-13

https://rpubs.com/dgolicher/crickdata_package

The way that data is shared online is a constant source of unnecessary frustrations. Last weekend I was curious about the characteristics of batmen’s innings while watching the test match between England and South Africa. See my posts on RPubs. I decided to resolve my curiosity by getting the original data. Now, anyone working with such data must keep it neatly and tidily in a relational data base, with a table for the individual players, a table for batting, bowling, fielding and match statistics as a minimum. Querying the data base directly would require some knowledge of the relational structure, so it makes sense to help users by providing a web interface to run the queries. ESPN’s cric info does just that. But why does it have to result in tables that have to be downloaded page by page? And why does the formatted data mix numbers with text? Instead of total runs scored and a column containing the number of wickets if an innings was declared or abandoned cricinfo “helpfully” gives the score in the conventional format such as 625/7 d for 625 for seven declared. The result was that it took me the best part of last sunday afternoon to get some very routine data tables into shape using regexprs, greps and gsubs.

As I now have this data to hand, this sunday I took another hour out to build a small R package to share the data easily. It has three data object. Batting, bowling and innings. All are up to date as of 5 August 2017 and contain all the relevant data from ESPN cric info in a fairly self explanatory format (if you refer to the original data tables).

Use devtools to install the package, with its very rudimentary documentation.

library(devtools) install_github("dgolicher/crickdata")

Data tables

library(crickdata) data(bowling) head(bowling)
## Player Overs BPO Mdns Runs Wkts Econ Inns Opposition ## 1 JC Laker (ENG) 51.2 6 23 53 10 1.03 3 v Australia ## 2 A Kumble (INDIA) 26.3 6 9 74 10 2.79 4 v Pakistan ## 3 GA Lohmann (ENG) 14.2 5 6 28 9 2.33 2 v South Africa ## 4 JC Laker (ENG) 16.4 6 4 37 9 2.22 2 v Australia ## 5 M Muralitharan (SL) 40.0 6 19 51 9 1.27 1 v Zimbabwe ## 6 Sir RJ Hadlee (NZ) 23.4 6 4 52 9 2.19 1 v Australia ## Ground Start.Date Country Date Year Day Month Yday type ## 1 Manchester 26 Jul 1956 ENG 1956-07-26 1956 26 7 208 Test ## 2 Delhi 4 Feb 1999 INDIA 1999-02-04 1999 4 2 35 Test ## 3 Johannesburg 2 Mar 1896 ENG 1896-03-02 1896 2 3 62 Test ## 4 Manchester 26 Jul 1956 ENG 1956-07-26 1956 26 7 208 Test ## 5 Kandy 4 Jan 2002 SL 2002-01-04 2002 4 1 4 Test ## 6 Brisbane 8 Nov 1985 NZ 1985-11-08 1985 8 11 312 Test
data(batting) head(batting)
## Player Runs Mins BF SR Inns Opposition ## 1 BC Lara (WI) 400 778 582 68.72 1 v England ## 2 ML Hayden (AUS) 380 622 437 86.95 1 v Zimbabwe ## 3 BC Lara (WI) 375 766 538 69.70 1 v England ## 4 DPMD Jayawardene (SL) 374 752 572 65.38 2 v South Africa ## 5 GS Sobers (WI) 365 614 NA NA 2 v Pakistan ## 6 L Hutton (ENG) 364 797 847 42.97 1 v Australia ## Ground Start.Date Country Notout Date Year Day Month Yday ## 1 St John's 10 Apr 2004 WI TRUE 2004-04-10 2004 10 4 101 ## 2 Perth 9 Oct 2003 AUS 2003-10-09 2003 9 10 282 ## 3 St John's 16 Apr 1994 WI 1994-04-16 1994 16 4 106 ## 4 Colombo (SSC) 27 Jul 2006 SL 2006-07-27 2006 27 7 208 ## 5 Kingston 26 Feb 1958 WI TRUE 1958-02-26 1958 26 2 57 ## 6 The Oval 20 Aug 1938 ENG 1938-08-20 1938 20 8 232 ## Fours Sixs type ## 1 43 4 Test ## 2 38 11 Test ## 3 45 0 Test ## 4 43 1 Test ## 5 38 0 Test ## 6 35 0 Test
data(innings) head(innings)
## Team Score Overs RPO Lead Inns Result Opposition ## 1 Sri Lanka 952/6d 271.0 3.51 415 2 draw v India ## 2 England 903/7d 335.2 2.69 903 1 won v Australia ## 3 England 849 258.2 3.28 849 1 draw v West Indies ## 4 West Indies 790/3d 208.1 3.79 462 2 won v Pakistan ## 5 Pakistan 765/6d 248.5 3.07 121 2 draw v Sri Lanka ## 6 Sri Lanka 760/7d 202.4 3.75 334 2 draw v India ## Ground Start.Date Total Date Year Day Month Yday type ## 1 Colombo (RPS) 2 Aug 1997 952 1997-08-02 1997 2 8 214 Test ## 2 The Oval 20 Aug 1938 903 1938-08-20 1938 20 8 232 Test ## 3 Kingston 3 Apr 1930 849 1930-04-03 1930 3 4 93 Test ## 4 Kingston 26 Feb 1958 790 1958-02-26 1958 26 2 57 Test ## 5 Karachi 21 Feb 2009 765 2009-02-21 2009 21 2 52 Test ## 6 Ahmedabad 16 Nov 2009 760 2009-11-16 2009 16 11 320 Test ## declared ## 1 TRUE ## 2 TRUE ## 3 FALSE ## 4 TRUE ## 5 TRUE ## 6 TRUE

Updating

I also added three functions for obtaining the latest data. They all take two arguments, the year and the number of additional pages (each page is fifty records) that will be needed to obtain all the data since 5 August 2017.

d<-latest_batting(2017,npages=1) head(d)
## Player Runs Mins BF 4s 6s SR Inns Opposition Ground ## 1 S Dhawan (INDIA) 119 - 123 17 0 96.74 1 v Sri Lanka Pallekele ## 2 HH Pandya (INDIA) 108 - 96 8 7 112.50 1 v Sri Lanka Pallekele ## 3 KL Rahul (INDIA) 85 - 135 8 0 62.96 1 v Sri Lanka Pallekele ## 4 LD Chandimal (SL) 48 - 87 6 0 55.17 2 v India Pallekele ## 5 V Kohli (INDIA) 42 - 84 3 0 50.00 1 v Sri Lanka Pallekele ## 6 R Ashwin (INDIA) 31 - 75 1 0 41.33 1 v Sri Lanka Pallekele ## Start Date Country Notout Date Year Day Month Yday Fours Sixs ## 1 12 Aug 2017 INDIA 2017-08-12 2017 12 8 224 17 0 ## 2 12 Aug 2017 INDIA 2017-08-12 2017 12 8 224 8 7 ## 3 12 Aug 2017 INDIA 2017-08-12 2017 12 8 224 8 0 ## 4 12 Aug 2017 SL 2017-08-12 2017 12 8 224 6 0 ## 5 12 Aug 2017 INDIA 2017-08-12 2017 12 8 224 3 0 ## 6 12 Aug 2017 INDIA 2017-08-12 2017 12 8 224 1 0 ## type ## 1 Test ## 2 Test ## 3 Test ## 4 Test ## 5 Test ## 6 Test

Same goes for latest_bowling() and latest_innings(). These are very messy functions as can be seen from sourcing them. but its not worth investing any more time tidying them up as they work at the moment and the site itself is likely to change it’s interface.

The data can be merged with the main data set using dplyr.

library(dplyr) data(batting) d$Mins<-0 ## Note that there was no time on the latest data which causes a problem again! batting<-bind_rows(d,batting)
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s