ЯтомизоnoR

R, Statistics

Write file as UTF-8 encoding in R for Windows

While the R uses UTF-8 encoding as default on Linux and Mac OS, the R for Windows does not use UTF-8 as default. So reading and writing UTF-8 files are something troublesome on Windows. In this article, I will show you a small script to help UTF-8 encoding.

What is the locale

On Mac OX (Japanese),

> Sys.getlocale('LC_CTYPE')
 [1] "ja_JP.UTF-8"

On Windows (Japanese),

> Sys.getlocale('LC_CTYPE')
[1] "Japanese_Japan.932"

On Mac OX (Russian),

> Sys.getlocale('LC_CTYPE')
[1] "ru_RU.UTF-8"

On Windows (Russian),

> Sys.getlocale('LC_CTYPE')
[1] "Russian_Russia.1251"

The “LC_CTYPE” locale controls the encoding when the R writes a text file. As shown above, there is a clear difference of architecture between the Windows and the Mac OS.

Difference between Windows and other OSs

I am trying to say as simple as I can. The Windows chooses one of many language sets, however, the Linux and the Mac OS choose one language subset of a UTF-8 set. By this difference, the Windows forgets characters of unselected languages, while other OSs remember characters of all languages.

Problem on Windows

When a text is written to a file, characters of unselected locale languages can not be handled. Some of them are converted into a similar (but incorrect!) character, and others are written as escaped format such as <U+222D>.

Mind that the R is not responsible for this problem. Because the OS’s architecture of switching languages is generating the problem.

Locales

On Mac and Linux, they are shown by the command “locale -a”. Informations for Windows is valuable. Because Microsoft organize the information by the product version, the lifetime of urls is rather short.

http://msdn.microsoft.com/en-us/library/hzz3tw78(v=vs.80).aspx
http://msdn.microsoft.com/en-us/library/39cwe7zf(v=vs.80).aspx
http://msdn.microsoft.com/en-us/library/cdax410z(v=vs.80).aspx
http://msdn.microsoft.com/en-us/library/2x8et5ee(v=vs.80).aspx
http://msdn.microsoft.com/en-us/library/aa288104(v=vs.71).aspx

This page gathers the information nicely.
http://docs.moodle.org/dev/Table_of_locales

Experiment

Let’s check what is going on.

Because I do not have a high grade Windows PC that covers multiple languages, I want to check this upon a Mac. The Mac OS has a lot of builtin locales that covers not only many languages but also Windows specific encodings. So I can use the Mac to simulate multiple language environments of Windows. Note that locale keywords to specify same encoding are different between Mac and Windows.

The methods are as below.

  1. Set a locale to simulate a Window encoding.
  2. Write a short UTF-8 string into a file.
  3. View the file with Safari.

A web browser (like Safari) is best to view result files, because it is designed to browse a text with worldwide’s encodings.

I tried the following script to perform this test.

X <- c('0052', '0053', '0054', '00AE', '011B', 'FF32', '211D',
       '222D', '25C9', '03B1', '042F', '3042', '4E16', '8FDB')

LOCALES <- c('ja_JP.UTF-8', 'en_US.UTF-8', 'C', 
             'ja_JP.SJIS', 'en_US.US-ASCII', 
             'ru_RU.CP1251', 'cs_CZ.ISO8859-2')

toUnicodeString <- function(x) {
  paste(sapply(x, function(a) 
               eval(parse(text=paste('"\\u', a, '"', sep='')))), collapse='')
}

fileByLocale <- function(x, locale, file) {
  con <- file(file, open='wt')
  writeLines(Sys.setlocale("LC_CTYPE", locale), con)
  writeLines(toUnicodeString(x), con)
  close(con)
  Sys.setlocale("LC_CTYPE")
}

scanLocales <- function(x, locales, hint='LC') {
  for(locale in locales) {
    print(locale)
    print(file <- paste(hint, gsub('[^A-Za-z0-9]', '', locale), '.txt', sep=''))
    fileByLocale(x, locale, file)
  }
}

scanLocales(X, LOCALES)
Fig. 1. Unicode string

Fig. 1. Unicode string

I changed only “LC_CTYPE”, not entire “LC_ALL”. This is to minimize the impact. The unicode string is regenerated from a raw bytes array in every locale. The first 2 locales are for Mac OS. The 3rd “C” is only for ascii characters, and used commonly in Linux, Mac OS, and Windows. The 4th and the later is for emulating Windows.

Fig. 2. Results

Fig. 2. Results

As shown above, when the R writes a UTF-8 text into a file on Windows, characters of unsupported language are modified. In contrast, all characters are written correctly in Mac OS.

Using binary

There is a solution for this problem. Writing a binary file instead of a text file solves this. All applications handling a UTF-8 file in Windows are using the same trick.

BOM

The BOM should not be used in UTF-8 files.  This is what the Linux and the Mac OS are doing. But the Windows Notepad and some applications use the BOM.
So, handling the BOM is needed, in spite of grammatically wrong.

The BOM is the 3 bytes character put at the beginning of a text file, but because the R does not use the BOM, it should be removed on reading.

BOM <- charToRaw('\xEF\xBB\xBF')

Write UTF-8 file

writeUtf8 <- function(x, file, bom=F) {
  con <- file(file, "wb")
  if(bom) writeBin(BOM, con, endian="little")
  writeBin(charToRaw(x), con, endian="little")
  close(con)
}

Specify a UTF-8 string as x=, and a file name to write as file=. If you want to read the file only with the Windows Notepad, adding a BOM by the bom=T option is a good choice. Note that this is a minimum script, and not meant to write a very large file.

Read UTF-8 file

Reading a UTF-8 is easy, because functions like readLines have encoding= options that accepts UTF-8.

readUtf8Text <- function(file) {
  con <- file(file, 'rt')
  result <- readLines(con, encoding='utf-8')
  close(con)
  result
}

If you want to read a UTF-8 file saved by Windows standard applications like Notepad, you may have a trouble. Because the Windows Notepad appends BOM at writing a UTF-8 file, you must remove the BOM on the R. Or the BOM will appear as a corrupted character at the beginning of the string. Now, the R 3.0.0 supports UTF-8-BOM encoding to remove the BOM. However, if you want to use R 2.15.3 for a while, you must remove the BOM manually. The following code reads a UTF-8 file as binary and remove the BOM. Note that this is a minimum script, and not meant to read a very large file.

readUtf8 <- function(file) {
  size <- file.info(file)$size
  con <- file(file, "rb")
  x <- readBin(con, raw(), size, endian="little")
  close(con)
  pstart <- ifelse(all(x[1:3]==BOM), 4, 1)
  pend <- length(x)
  rawToChar(x[pstart:pend])
}
Advertisements

3 comments on “Write file as UTF-8 encoding in R for Windows

  1. Pingback: Pitfall of XML package: issues specific to cp932 locale, Japanese Shift-JIS, on Windows | ЯтомизоnoR

  2. Aik
    April 21, 2015

    Thanks for the post. Was having an issue with a script that I got off Github (kept saving automatically as ANSI which led to issues reading the file) and your post helped me to solve the problem.

  3. Pingback: R | Annotary

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Information

This entry was posted on April 17, 2013 by and tagged , , , , , , , , , , , , , , .
The stupidest thing...

Statistics, genetics, programming, academics

ЯтомизоnoR

R, Statistics

%d bloggers like this: