Write file as UTF-8 encoding in R for Windows

While the R uses UTF-8 encoding as default on Linux and Mac OS, the R for Windows does not use UTF-8 as default. So reading and writing UTF-8 files are something troublesome on Windows. In this article, I will show you a small script to help UTF-8 encoding.

What is the locale

On Mac OX (Japanese),

> Sys.getlocale('LC_CTYPE')
 [1] "ja_JP.UTF-8"

On Windows (Japanese),

> Sys.getlocale('LC_CTYPE')
[1] "Japanese_Japan.932"

On Mac OX (Russian),

> Sys.getlocale('LC_CTYPE')
[1] "ru_RU.UTF-8"

On Windows (Russian),

> Sys.getlocale('LC_CTYPE')
[1] "Russian_Russia.1251"

The “LC_CTYPE” locale controls the encoding when the R writes a text file. As shown above, there is a clear difference of architecture between the Windows and the Mac OS.

Difference between Windows and other OSs

I am trying to say as simple as I can. The Windows chooses one of many language sets, however, the Linux and the Mac OS choose one language subset of a UTF-8 set. By this difference, the Windows forgets characters of unselected languages, while other OSs remember characters of all languages.

Problem on Windows

When a text is written to a file, characters of unselected locale languages can not be handled. Some of them are converted into a similar (but incorrect!) character, and others are written as escaped format such as <U+222D>.

Mind that the R is not responsible for this problem. Because the OS’s architecture of switching languages is generating the problem.

Locales

On Mac and Linux, they are shown by the command “locale -a”. Informations for Windows is valuable. Because Microsoft organize the information by the product version, the lifetime of urls is rather short.

http://msdn.microsoft.com/en-us/library/hzz3tw78(v=vs.80).aspx
http://msdn.microsoft.com/en-us/library/39cwe7zf(v=vs.80).aspx
http://msdn.microsoft.com/en-us/library/cdax410z(v=vs.80).aspx
http://msdn.microsoft.com/en-us/library/2x8et5ee(v=vs.80).aspx
http://msdn.microsoft.com/en-us/library/aa288104(v=vs.71).aspx

This page gathers the information nicely.
http://docs.moodle.org/dev/Table_of_locales

Experiment

Let’s check what is going on.

Because I do not have a high grade Windows PC that covers multiple languages, I want to check this upon a Mac. The Mac OS has a lot of builtin locales that covers not only many languages but also Windows specific encodings. So I can use the Mac to simulate multiple language environments of Windows. Note that locale keywords to specify same encoding are different between Mac and Windows.

The methods are as below.

Set a locale to simulate a Window encoding.
Write a short UTF-8 string into a file.
View the file with Safari.

A web browser (like Safari) is best to view result files, because it is designed to browse a text with worldwide’s encodings.

I tried the following script to perform this test.

X <- c('0052', '0053', '0054', '00AE', '011B', 'FF32', '211D',
       '222D', '25C9', '03B1', '042F', '3042', '4E16', '8FDB')

LOCALES <- c('ja_JP.UTF-8', 'en_US.UTF-8', 'C', 
             'ja_JP.SJIS', 'en_US.US-ASCII', 
             'ru_RU.CP1251', 'cs_CZ.ISO8859-2')

toUnicodeString <- function(x) {
  paste(sapply(x, function(a) 
               eval(parse(text=paste('"\\u', a, '"', sep='')))), collapse='')
}

fileByLocale <- function(x, locale, file) {
  con <- file(file, open='wt')
  writeLines(Sys.setlocale("LC_CTYPE", locale), con)
  writeLines(toUnicodeString(x), con)
  close(con)
  Sys.setlocale("LC_CTYPE")
}

scanLocales <- function(x, locales, hint='LC') {
  for(locale in locales) {
    print(locale)
    print(file <- paste(hint, gsub('[^A-Za-z0-9]', '', locale), '.txt', sep=''))
    fileByLocale(x, locale, file)
  }
}

scanLocales(X, LOCALES)

Fig. 1. Unicode string

I changed only “LC_CTYPE”, not entire “LC_ALL”. This is to minimize the impact. The unicode string is regenerated from a raw bytes array in every locale. The first 2 locales are for Mac OS. The 3rd “C” is only for ascii characters, and used commonly in Linux, Mac OS, and Windows. The 4th and the later is for emulating Windows.

Fig. 2. Results

As shown above, when the R writes a UTF-8 text into a file on Windows, characters of unsupported language are modified. In contrast, all characters are written correctly in Mac OS.

Using binary

There is a solution for this problem. Writing a binary file instead of a text file solves this. All applications handling a UTF-8 file in Windows are using the same trick.

BOM

The BOM should not be used in UTF-8 files. This is what the Linux and the Mac OS are doing. But the Windows Notepad and some applications use the BOM.
So, handling the BOM is needed, in spite of grammatically wrong.

The BOM is the 3 bytes character put at the beginning of a text file, but because the R does not use the BOM, it should be removed on reading.

BOM <- charToRaw('\xEF\xBB\xBF')

Write UTF-8 file

writeUtf8 <- function(x, file, bom=F) {
  con <- file(file, "wb")
  if(bom) writeBin(BOM, con, endian="little")
  writeBin(charToRaw(x), con, endian="little")
  close(con)
}

Specify a UTF-8 string as x=, and a file name to write as file=. If you want to read the file only with the Windows Notepad, adding a BOM by the bom=T option is a good choice. Note that this is a minimum script, and not meant to write a very large file.

Read UTF-8 file

Reading a UTF-8 is easy, because functions like readLines have encoding= options that accepts UTF-8.

readUtf8Text <- function(file) {
  con <- file(file, 'rt')
  result <- readLines(con, encoding='utf-8')
  close(con)
  result
}

If you want to read a UTF-8 file saved by Windows standard applications like Notepad, you may have a trouble. Because the Windows Notepad appends BOM at writing a UTF-8 file, you must remove the BOM on the R. Or the BOM will appear as a corrupted character at the beginning of the string. Now, the R 3.0.0 supports UTF-8-BOM encoding to remove the BOM. However, if you want to use R 2.15.3 for a while, you must remove the BOM manually. The following code reads a UTF-8 file as binary and remove the BOM. Note that this is a minimum script, and not meant to read a very large file.

readUtf8 <- function(file) {
  size <- file.info(file)$size
  con <- file(file, "rb")
  x <- readBin(con, raw(), size, endian="little")
  close(con)
  pstart <- ifelse(all(x[1:3]==BOM), 4, 1)
  pend <- length(x)
  rawToChar(x[pstart:pend])
}

4 comments on “Write file as UTF-8 encoding in R for Windows”

Pingback: Pitfall of XML package: issues specific to cp932 locale, Japanese Shift-JIS, on Windows | ЯтомизоnoR
Aik
April 21, 2015

Thanks for the post. Was having an issue with a script that I got off Github (kept saving automatically as ANSI which led to issues reading the file) and your post helped me to solve the problem.

Reply
- Pritika
  October 9, 2017
  
  Hi. Can you also help me out. I am also facing the same issue(kept saving automatically as ANSI which led to issues reading the file)
Pingback: R | Annotary

	tomizono on Tips for Ellipse Summary …
	tomizono on Draw an Ellipse Summary Plot i…
	Tips for Ellipse Sum… on Tips for Ellipse Summary …
	Pritika on Write file as UTF-8 encoding i…
	karilint on boxplotdbl 1.2.2 and diaplt 1.…

ЯтомизоnoR

Write file as UTF-8 encoding in R for Windows

What is the locale

Difference between Windows and other OSs

Problem on Windows

Locales

Experiment

Using binary

BOM

Write UTF-8 file

Read UTF-8 file

4 comments on “Write file as UTF-8 encoding in R for Windows”

Leave a comment Cancel reply

Information

Shortlink

Navigation

Follow Blog via Email

Recent Posts

Recent Comments

Categories

Archives

Blogroll

Blogs I Follow

Meta

ЯтомизоnoR

Write file as UTF-8 encoding in R for Windows

What is the locale

Difference between Windows and other OSs

Problem on Windows

Locales

Experiment

Using binary

BOM

Write UTF-8 file

Read UTF-8 file

Share this:

Related

4 comments on “Write file as UTF-8 encoding in R for Windows”

Leave a comment Cancel reply

Information

Shortlink

Navigation

Follow Blog via Email

Recent Posts

Recent Comments

Categories

Archives

Blogroll

Blogs I Follow

Meta