While the R uses UTF-8 encoding as default on Linux and Mac OS, the R for Windows does not use UTF-8 as default. So reading and writing UTF-8 files are something troublesome on Windows. In this article, I will show you a small script to help UTF-8 encoding.
On Mac OX (Japanese),
> Sys.getlocale('LC_CTYPE') [1] "ja_JP.UTF-8"
On Windows (Japanese),
> Sys.getlocale('LC_CTYPE') [1] "Japanese_Japan.932"
On Mac OX (Russian),
> Sys.getlocale('LC_CTYPE') [1] "ru_RU.UTF-8"
On Windows (Russian),
> Sys.getlocale('LC_CTYPE') [1] "Russian_Russia.1251"
The “LC_CTYPE” locale controls the encoding when the R writes a text file. As shown above, there is a clear difference of architecture between the Windows and the Mac OS.
I am trying to say as simple as I can. The Windows chooses one of many language sets, however, the Linux and the Mac OS choose one language subset of a UTF-8 set. By this difference, the Windows forgets characters of unselected languages, while other OSs remember characters of all languages.
When a text is written to a file, characters of unselected locale languages can not be handled. Some of them are converted into a similar (but incorrect!) character, and others are written as escaped format such as <U+222D>.
Mind that the R is not responsible for this problem. Because the OS’s architecture of switching languages is generating the problem.
On Mac and Linux, they are shown by the command “locale -a”. Informations for Windows is valuable. Because Microsoft organize the information by the product version, the lifetime of urls is rather short.
http://msdn.microsoft.com/en-us/library/hzz3tw78(v=vs.80).aspx
http://msdn.microsoft.com/en-us/library/39cwe7zf(v=vs.80).aspx
http://msdn.microsoft.com/en-us/library/cdax410z(v=vs.80).aspx
http://msdn.microsoft.com/en-us/library/2x8et5ee(v=vs.80).aspx
http://msdn.microsoft.com/en-us/library/aa288104(v=vs.71).aspx
This page gathers the information nicely.
http://docs.moodle.org/dev/Table_of_locales
Let’s check what is going on.
Because I do not have a high grade Windows PC that covers multiple languages, I want to check this upon a Mac. The Mac OS has a lot of builtin locales that covers not only many languages but also Windows specific encodings. So I can use the Mac to simulate multiple language environments of Windows. Note that locale keywords to specify same encoding are different between Mac and Windows.
The methods are as below.
A web browser (like Safari) is best to view result files, because it is designed to browse a text with worldwide’s encodings.
I tried the following script to perform this test.
X <- c('0052', '0053', '0054', '00AE', '011B', 'FF32', '211D', '222D', '25C9', '03B1', '042F', '3042', '4E16', '8FDB') LOCALES <- c('ja_JP.UTF-8', 'en_US.UTF-8', 'C', 'ja_JP.SJIS', 'en_US.US-ASCII', 'ru_RU.CP1251', 'cs_CZ.ISO8859-2') toUnicodeString <- function(x) { paste(sapply(x, function(a) eval(parse(text=paste('"\\u', a, '"', sep='')))), collapse='') } fileByLocale <- function(x, locale, file) { con <- file(file, open='wt') writeLines(Sys.setlocale("LC_CTYPE", locale), con) writeLines(toUnicodeString(x), con) close(con) Sys.setlocale("LC_CTYPE") } scanLocales <- function(x, locales, hint='LC') { for(locale in locales) { print(locale) print(file <- paste(hint, gsub('[^A-Za-z0-9]', '', locale), '.txt', sep='')) fileByLocale(x, locale, file) } } scanLocales(X, LOCALES)
I changed only “LC_CTYPE”, not entire “LC_ALL”. This is to minimize the impact. The unicode string is regenerated from a raw bytes array in every locale. The first 2 locales are for Mac OS. The 3rd “C” is only for ascii characters, and used commonly in Linux, Mac OS, and Windows. The 4th and the later is for emulating Windows.
As shown above, when the R writes a UTF-8 text into a file on Windows, characters of unsupported language are modified. In contrast, all characters are written correctly in Mac OS.
There is a solution for this problem. Writing a binary file instead of a text file solves this. All applications handling a UTF-8 file in Windows are using the same trick.
The BOM should not be used in UTF-8 files. This is what the Linux and the Mac OS are doing. But the Windows Notepad and some applications use the BOM.
So, handling the BOM is needed, in spite of grammatically wrong.
The BOM is the 3 bytes character put at the beginning of a text file, but because the R does not use the BOM, it should be removed on reading.
BOM <- charToRaw('\xEF\xBB\xBF')
writeUtf8 <- function(x, file, bom=F) { con <- file(file, "wb") if(bom) writeBin(BOM, con, endian="little") writeBin(charToRaw(x), con, endian="little") close(con) }
Specify a UTF-8 string as x=, and a file name to write as file=. If you want to read the file only with the Windows Notepad, adding a BOM by the bom=T option is a good choice. Note that this is a minimum script, and not meant to write a very large file.
Reading a UTF-8 is easy, because functions like readLines have encoding= options that accepts UTF-8.
readUtf8Text <- function(file) { con <- file(file, 'rt') result <- readLines(con, encoding='utf-8') close(con) result }
If you want to read a UTF-8 file saved by Windows standard applications like Notepad, you may have a trouble. Because the Windows Notepad appends BOM at writing a UTF-8 file, you must remove the BOM on the R. Or the BOM will appear as a corrupted character at the beginning of the string. Now, the R 3.0.0 supports UTF-8-BOM encoding to remove the BOM. However, if you want to use R 2.15.3 for a while, you must remove the BOM manually. The following code reads a UTF-8 file as binary and remove the BOM. Note that this is a minimum script, and not meant to read a very large file.
readUtf8 <- function(file) { size <- file.info(file)$size con <- file(file, "rb") x <- readBin(con, raw(), size, endian="little") close(con) pstart <- ifelse(all(x[1:3]==BOM), 4, 1) pend <- length(x) rawToChar(x[pstart:pend]) }
Statistics, genetics, programming, academics
R, Statistics
Pingback: Pitfall of XML package: issues specific to cp932 locale, Japanese Shift-JIS, on Windows | ЯтомизоnoR
Thanks for the post. Was having an issue with a script that I got off Github (kept saving automatically as ANSI which led to issues reading the file) and your post helped me to solve the problem.
Hi. Can you also help me out. I am also facing the same issue(kept saving automatically as ANSI which led to issues reading the file)
Pingback: R | Annotary