ЯтомизоnoR

R, Statistics

How to get a data frame from html pages directly in R

Chinese air polution is quite serious recently.  At last, in this spring, the air polution has been invading Japan.  Many municipal agencies in Japan have began to observe the pollutant, namely particulate matter under 2.5μm (PM 2.5), to guard resident’s health.

I’m checking the PM 2.5 concentration every day in the web page (Fig. 1), because I do not want to jog at outdoor when the concentration is high.  A problem is the agency’s web page has too many information.  That prevents quick watching.  So I want to convert the data into a nice looking chart.

Fig. 1. Ibaraki Prefecture shows hourly air pollution data on the web

Fig. 1. Ibaraki Prefecture shows hourly air pollution data on the web

First, I thought I had to write some R programs to complete that.  But I was wrong.  Thanks to the CRAN (http://cran.r-project.org) and the contributors, there is already a package to do that.

Here I show the procedure to get a web page, parse it, and finally draw a chart.

1. install package XML from CRAN

http://cran.r-project.org/web/packages/XML/

install.packages("XML")
library(XML)

2. parse html table of specified url

tables=readHTMLTable("http://www.taiki.pref.ibaraki.jp/data.asp")

3. check the rough result and drill down

str(tables)
Fig. 2. parsing all tables by readHTMLTable

Fig. 2. parsing all tables by readHTMLTable

I need the sixth table and the 1st row must be ignored.
Also, I must specify the column type explicitly to convert hyphens into the NA value.

4. run the improved command

ibaraki=readHTMLTable("http://www.taiki.pref.ibaraki.jp/data.asp",which=6,skip.rows=1,trim=T,colClasses=c("integer","character",rep("numeric",13)))
Fig. 3. parsing the sixth table by readHTMLTable

Fig. 3. parsing the sixth table by readHTMLTable

5. focus to the PM2.5 column and remove NAs

pm.2.5.ibaraki<-ibaraki[!is.na(ibaraki[,12]),c(2,12)]
pm.2.5.ibaraki[,1]=as.factor(as.character(pm.2.5.ibaraki[,1]))
Fig. 4. observed value of PM 2.5

Fig. 4. observed value of PM 2.5

6. draw a chart

plot(pm.2.5.ibaraki)

Now, make the look better, and enable Japanese fonts. (referring to http://kohske.wordpress.com/2011/02/26/using-cjk-fonts-in-r-and-ggplot2/ for the font setting)

quartzFonts(HiraMaru=quartzFont(rep("HiraMaruProN-W4", 4)))
par(family="HiraMaru")
plot(pm.2.5.ibaraki,main="PM2.5 Air Polution Come From China (μg/㎥)",cex.axis=0.8)
Fig. 5. PM 2.5 plot

Fig. 5. PM 2.5 plot

Advertisements

6 comments on “How to get a data frame from html pages directly in R

  1. Pingback: First step on GIS with R | ЯтомизоnoR

  2. Pingback: Somewhere else, part 47 | Freakonometrics

  3. Excellent; thank you!

  4. Pingback: How to get a data frame from html pages directl...

  5. Pingback: Bad case for readHTMLTable function of package XML | ЯтомизоnoR

  6. Pingback: Polution « picturemypoetry

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Information

This entry was posted on March 26, 2013 by and tagged , , , .
The stupidest thing...

Statistics, genetics, programming, academics

ЯтомизоnoR

R, Statistics

%d bloggers like this: