R, Statistics

Fix the encode of data frame

People in English speaking countries may not understand, but the character encoding is indeed troublesome. Fortunately, the R have iconv() function for converting text encoding, but what is for data frames? For some reasons, a data frame would be constructed with a different encoding from the console. How to fix that?

According to this page, the best way is the followings; write the entire data frame to a temporary file, and then read it with a correct encoding. I wrote a simple function to help converting the data frame encoding.

toLocalEncoding <-
function(x, sep="\t", quote=FALSE, encoding="utf-8")
  rawtsv <- tempfile()
  write.table(x, file=rawtsv, sep=sep, quote=quote)
  result <- read.table(file(rawtsv, encoding=encoding), sep=sep, quote=quote)

This function writes a data frame into a temporary file as a tab separated values text.  And then read it  with a specified encoding to restore a frame set.

ibaraki <- toLocalEncoding(readHTMLTable("http://www.taiki.pref.ibaraki.jp/data.asp", which=6, skip.rows=1,  stringsAsFactors=F, encoding="shift-jis"), encoding="utf-8")

On Japanese Windows, the readHTMLTable function generates a utf-8 encoded data frame, and that is a trouble because the OS is using cp932 encoding.  The toLocalEncoding function is a simple way to get out from the trouble.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s


This entry was posted on April 15, 2013 by and tagged , , , , , , , , , , .
The stupidest thing...

Statistics, genetics, programming, academics


R, Statistics

%d bloggers like this: