Emoji, accents, and international text (2024)

Character encoding

Before we can analyze a text in R, we first need to get its digitalrepresentation, a sequence of ones and zeros. In practice this works byfirst choosing an encoding for the text that assigns eachcharacter a numerical value, and then translating the sequence ofcharacters in the text to the corresponding sequence of numbersspecified by the encoding. Today, most new text is encoded according tothe Unicode standard,specifically the 8-bit block Unicode Transfer Format, UTF-8. Joel Spolsky givesa good overview of the situation in an essayfrom 2003.

The software community has mostly moved to UTF-8 as a standard fortext storage and interchange, but there is still a large volume of textin other encodings. Whenever you read a text file into R, you need tospecify the encoding. If you don’t, R will try to guess the encoding,and if it guesses incorrectly, it will wrongly interpret the sequence ofones and zeros.

We will demonstrate the difficulties of encodings with the text ofJane Austen’s novel, Mansfield Park provided by Project Gutenberg. We will downloadthe text, then read in the lines of the novel.

# download the zipped text from a Project Gutenberg mirrorurl <- "http://mirror.csclub.uwaterloo.ca/gutenberg/1/4/141/141.zip"tmp <- tempfile()download.file(url, tmp)# read the text from the zip filecon <- unz(tmp, "141.txt", encoding = "UTF-8")lines <- readLines(con)close(con)

The unz function and other similar file connectionfunctions have encoding arguments which, if leftunspecified default to assuming that text is encoded in your operatingsystem’s native encoding. To ensure consistent behavior across allplatforms (Mac, Windows, and Linux), you should set this optionexplicitly. Here, we set encoding = "UTF-8". This is areasonable default, but it is not always appropriate. In general, youshould determine the appropriate encoding value by lookingat the file. Unfortunately, the file extension ".txt" isnot informative, and could correspond to any encoding. However, if weread the first few lines of the file, we see the following:

lines[11:20]

 [1] "Author: Jane Austen" [2] "" [3] "Release Date: June, 1994 [Etext #141]" [4] "Posting Date: February 11, 2015" [5] "" [6] "Language: English" [7] "" [8] "Character set encoding: ASCII" [9] "" [10] "*** START OF THIS PROJECT GUTENBERG EBOOK MANSFIELD PARK ***"

The character set encoding is reported as ASCII, which is a subset ofUTF-8. So, we should be in good shape.

Unfortunately, we run into trouble as soon as we try to process thetext:

corpus::term_stats(lines) # produces an error

Error in corpus::term_stats(lines): argument entry 15252 is incorrectly marked as "UTF-8": invalid leading byte (0xA3) at position 36

The error message tells us that line 15252 contains an invalidbyte.

lines[15252]

[1] "the command of her beauty, and her \xa320,000, any one who could satisfy the"

We might wonder if there are other lines with invalid data. We canfind all such lines using the utf8_valid function:

lines[!utf8_valid(lines)]

[1] "the command of her beauty, and her \xa320,000, any one who could satisfy the"

So, there are no other invalid lines.

The offending byte in line 15252 is displayed as \xa3,an escape code for hexadecimal value 0xa3, decimal value 163. Tounderstand why this is invalid, we need to learn more about UTF-8encoding.

UTF-8

ASCII

The smallest unit of data transfer on modern computers is the byte, asequence of eight ones and zeros that can encode a number between 0 and255 (hexadecimal 0x00 and 0xff). In the earliest character encodings,the numbers from 0 to 127 (hexadecimal 0x00 to 0x7f) were standardizedin an encoding known as ASCII, the American Standard Code forInformation Interchange. Here are the characters corresponding to thesecodes:

codes <- matrix(0:127, 8, 16, byrow = TRUE, dimnames = list(0:7, c(0:9, letters[1:6])))ascii <- apply(codes, c(1, 2), intToUtf8)# replace control codes with ""ascii["0", c(0:6, "e", "f")] <- ""ascii["1",] <- ""ascii["7", "f"] <- ""utf8_print(ascii, quote = FALSE)

 0 1 2 3 4 5 6 7 8 9 a b c d e f0 \a \b \t \n \v \f \r 1 2 ! " # $ % & ' ( ) * + , - . /3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?4 @ A B C D E F G H I J K L M N O5 P Q R S T U V W X Y Z [ \\ ] ^ _6 ` a b c d e f g h i j k l m n o7 p q r s t u v w x y z { | } ~

The first 32 codes (the first two rows of the table) are specialcontrol codes, the most common of which, 0x0a denotes a newline (\n). The special code 0x00 often denotesthe end of the input, and R does not allow this value in characterstrings. Code 0x7f corresponds to a “delete” control.

When you call utf8_print, it uses the low levelutf8_encode subroutine format control codes; they format as\uXXXX for four hexadecimal digits XXXX or as\UXXXXYYYY for eight hexadecimal digitsXXXXYYYY:

utf8_print(intToUtf8(1:0x0f), quote = FALSE)

[1] \u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f

Compare utf8_print output with the output with the baseR print function:

print(intToUtf8(1:0x0f), quote = FALSE)

[1] \001\002\003\004\005\006\a\b\t\n\v\f\r\016\017

Base R format control codes below 128 using octal escapes. There aresome other differences between the function which we will highlightbelow.

Latin-1

ASCII works fine for most text in English, but not for otherlanguages. The Latin-1 encoding extends ASCII to Latin languages byassigning the numbers 128 to 255 (hexadecimal 0x80 to 0xff) to othercommon characters in Latin languages. We can see these charactersbelow.

codes <- matrix(128:255, 8, 16, byrow = TRUE, dimnames = list(c(8:9, letters[1:6]), c(0:9, letters[1:6])))latin1 <- apply(codes, c(1, 2), intToUtf8)# replace control codes with ""latin1[c("8", "9"),] <- ""utf8_print(latin1, quote = FALSE)

 0 1 2 3 4 5 6 7 8 9 a b c d e f8 9 a ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯b ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿c À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ïd Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ße à á â ã ä å æ ç è é ê ë ì í î ïf ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

As with ASCII, the first 32 numbers are control codes. The others arecharacters common in Latin languages. Note that 0xa3, theinvalid byte from Mansfield Park, corresponds to a pound signin the Latin-1 encoding. Given the context of the byte:

lines[15252]

[1] "the command of her beauty, and her \xa320,000, any one who could satisfy the"

this is probably the right symbol. The text is probably encoded inLatin-1, not UTF-8 or ASCII as claimed in the file.

If you run into an error while reading text that claims to be ASCII,it is probably encoded as Latin-1. Note, however, that this is not theonly possibility, and there are many other encodings. Theiconvlist function will list the ones that R knows how toprocess:

head(iconvlist(), n = 20)

 [1] "437" "850" "852" "855" [5] "857" "860" "861" "862" [9] "863" "865" "866" "869" [13] "ANSI_X3.4-1968" "ANSI_X3.4-1986" "ARABIC" "ARMSCII-8" [17] "ASCII" "ASMO-708" "ATARI" "ATARIST"

UTF-8

With only 256 unique values, a single byte is not enough to encodeevery character. Multi-byte encodings allow for encoding more. UTF-8encodes characters using between 1 and 4 bytes each and allows for up to1,112,064 character codes. Most of these codes are currently unassigned,but every year the Unicode consortium meets and adds new characters. Youcan find a list of all of the characters in the UnicodeCharacter Database. A listing of the Emoji characters is availableseparately.

Say you want to input the Unicode character with hexadecimal code0x2603. You can do so in one of three ways:

"\u2603" # with \u + 4 hex digits

[1] "☃"

"\U00002603" # with \U + 8 hex digits

[1] "☃"

intToUtf8(0x2603) # from an integer

[1] "☃"

For characters above 0xffff, the first method won’twork. On Windows, a bug in the current version of R (fixed in R-devel)prevents using the second method.

When you try to print Unicode in R, the system will first try todetermine whether the code is printable or not. Non-printable codesinclude control codes and unassigned codes. On Mac OS, R uses anoutdated function to make this determination, so it is unable to printmost emoji. The utf8_print function uses the most recentversion (10.0.0) of the Unicode standard, and will print all Unicodecharacters supported by your system:

print(intToUtf8(0x1f600 + 0:79)) # base R

[1] "\U0001f600\U0001f601\U0001f602\U0001f603\U0001f604\U0001f605\U0001f606\U0001f607\U0001f608\U0001f609\U0001f60a\U0001f60b\U0001f60c\U0001f60d\U0001f60e\U0001f60f\U0001f610\U0001f611\U0001f612\U0001f613\U0001f614\U0001f615\U0001f616\U0001f617\U0001f618\U0001f619\U0001f61a\U0001f61b\U0001f61c\U0001f61d\U0001f61e\U0001f61f\U0001f620\U0001f621\U0001f622\U0001f623\U0001f624\U0001f625\U0001f626\U0001f627\U0001f628\U0001f629\U0001f62a\U0001f62b\U0001f62c\U0001f62d\U0001f62e\U0001f62f\U0001f630\U0001f631\U0001f632\U0001f633\U0001f634\U0001f635\U0001f636\U0001f637\U0001f638\U0001f639\U0001f63a\U0001f63b\U0001f63c\U0001f63d\U0001f63e\U0001f63f\U0001f640\U0001f641\U0001f642\U0001f643\U0001f644\U0001f645\U0001f646\U0001f647\U0001f648\U0001f649\U0001f64a\U0001f64b\U0001f64c\U0001f64d\U0001f64e\U0001f64f"

utf8_print(intToUtf8(0x1f600 + 0:79)) # truncates to line width

[1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣…"

utf8_print(intToUtf8(0x1f600 + 0:79), chars = 500) # increase character limit

[1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏"

(Characters with codes above 0xffff, including most emoji, are notsupported on Windows.)

The utf8 package provides the following utilities forvalidating, formatting, and printing UTF-8 characters:

as_utf8() attempts to convert character data toUTF-8, throwing an error if the data is invalid;
utf8_valid() tests whether character data is validaccording to its declared encoding;
utf8_normalize() converts text to Unicode composednormal form (NFC), optionally applying case-folding and compatibilitymaps;
utf8_encode() encodes a character string, escapingall control characters, so that it can be safely printed to thescreen;
utf8_format() formats a character vector bytruncating to a specified character width limit or by left, right, orcenter justifying;
utf8_print() prints UTF-8 character data to thescreen;
utf8_width() measures the display with of UTF-8character strings (many emoji and East Asian characters are twice aswide as other characters).

The package does not provide a method to translate from anotherencoding to UTF-8 as the iconv() function from base Ralready serves this purpose.

Translating to UTF-8

Back to our original problem: getting the text of MansfieldPark into R. Our first attempt failed:

corpus::term_stats(lines)

Error in corpus::term_stats(lines): argument entry 15252 is incorrectly marked as "UTF-8": invalid leading byte (0xA3) at position 36

We discovered a problem on line 15252:

lines[15252]

[1] "the command of her beauty, and her \xa320,000, any one who could satisfy the"

The text is likely encoded in Latin-1, not UTF-8 (or ASCII) as we hadoriginally thought. We can test this by attempting to convert fromLatin-1 to UTF-8 with the iconv() function and inspectingthe output:

lines2 <- iconv(lines, "latin1", "UTF-8")lines2[15252]

[1] "the command of her beauty, and her £20,000, any one who could satisfy the"

It worked! Now we can analyze our text.

f <- corpus::text_filter(drop_punct = TRUE, drop = corpus::stopwords_en)corpus::term_stats(lines2, f)

 term count support1 fanny 816 8062 must 508 4923 crawford 493 4884 mr 482 4665 much 459 4506 miss 432 4197 said 406 4008 mrs 408 3999 sir 372 36610 edmund 364 36411 one 370 35812 think 349 34613 now 333 33114 might 324 32015 time 310 30716 little 309 30017 nothing 301 29118 well 299 28619 thomas 288 28520 good 280 275⋮ (8450 rows total)

The readtext package

If you need more than reading in a single text file, the readtext packagesupports reading in text in a variety of file formats and encodings.Beyond just plain text, that package can read in PDFs, Word documents,RTF, and many other formats. (Unfortunately, that package currentlyfails when trying to read in Mansfield Park; the authors areaware of the issue and are working on a fix.)

Summary

Text comes in a variety of encodings, and you cannot analyze a textwithout first knowing its encoding. Many functions for reading in textassume that it is encoded in UTF-8, but this assumption sometimes failsto hold. If you get an error message reporting that your UTF-8 text isinvalid, use utf8_valid to find the offending texts. Tryprinting the data to the console before and after usingiconv to convert between character encodings. You can useutf8_print to print UTF-8 characters that R refuses todisplay, including emoji characters. For reading in exotic file formatslike PDF or Word, try the readtext package.

Emoji, accents, and international text (2024)

Character encoding

UTF-8

ASCII

Latin-1

UTF-8

Translating to UTF-8

The readtext package

Summary

References