Unicode

4 Notes
+ Description 3 (Oct. 13, 2014, 3:19 a.m.)

http://nedbatchelder.com/text/unipain.html The first Fact of Life: everything in a computer is bytes. Files on disk are a series of bytes, and network connections only transmit bytes. Almost without exception, all the data going into or out of any program you write, is bytes. The problem with bytes is that by themselves they are meaningless, we need conventions to give them meaning. To represent text, we've been using the ASCII code for nearly 50 years. Every byte is assigned one of 95 symbols. When I send you a byte 65, you know that I mean an upper-case A, but only because we've agreed beforehand on what each byte represents. ISO Latin 1, or 8859-1, is ASCII extended with 96 more symbols. Windows added 27 more symbols to produce CP1252. This is pretty much the best you can do to represent text as single bytes, because there's not much room left to add more symbols. With character sets like these, we can represent at most 256 characters. But Fact of Life #2 is that there are way more than 256 symbols in the world's text. A single byte simply can't represent text world-wide. During your darkest whack-a-mole moments, you may have wished that everyone spoke English, but it simply isn't so. People need lots of symbols to communicate. People tried creating double-byte character sets, but they were still fragmented, serving different subsets of people. There were multiple standards in place, and they still weren't large enough to deal with all the symbols needed. Unicode was designed to deal decisively with the issues with older character codes. Unicode assigns integers, known as code points, to characters. It has room for 1.1 million code points, and only 110,000 are already assigned, so there's plenty of room for future growth. Unicode's goal is to have everything. It starts with ASCII, and includes thousands of symbols, including the famous Snowman, covers all the writing systems of the world, and is constantly being expanded. For example, the latest update gave us the symbol PILE OF POO. Here is a string of six exotic Unicode characters. Unicode code points are written as 4-, 5-, or 6-digits of hex with a U+ prefix. Every character has an unambiguous full name which is always in uppercase ASCII. This string is designed to look like the word "Python", but doesn't use any ASCII characters at all. So Unicode makes room for all of the characters we could ever need, but we still have Fact of Life #1 to deal with: computers need bytes. We need a way to represent Unicode code points as bytes in order to store or transmit them. The Unicode standard defines a number of ways to represent code points as bytes. These are called encodings. UTF-8 is easily the most popular encoding for storage and transmission of Unicode. It uses a variable number of bytes for each code point. The higher the code point value, the more bytes it needs in UTF-8. ASCII characters are one byte each, using the same values as ASCII, so ASCII is a subset of UTF-8. Here we show our exotic string as UTF-8. The ASCII characters H and i are single bytes, and other characters use two or three bytes depending on their code point value. Some Unicode code points require four bytes, but we aren't using any of those here. In Python 2, there are two different string data types. A plain-old string literal gives you a "str" object, which stores bytes. If you use a "u" prefix, you get a "unicode" object, which stores code points. In a unicode string literal, you can use backslash-u to insert any Unicode code point. >>> my_string = "Hello World" >>> type(my_string) <type 'str'> >>> my_unicode = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24" >>> type(my_unicode) <type 'unicode'> Str vs Unicode str: a sequence of bytes unicode: a sequence of code points (unicode) Byte strings and unicode strings each have a method to convert it to the other type of string. Unicode strings have a .encode() method that produces bytes, and byte strings have a .decode() method that produces unicode. Each takes an argument, which is the name of the encoding to use for the operation. .encode() and .decode() unicode .encode() → bytes bytes .decode() → unicode >>> my_unicode = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24" >>> len(my_unicode) 9 >>> my_utf8 = my_unicode.encode('utf-8') >>> len(my_utf8) 19 >>> my_utf8 'Hi \xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4' >>> my_utf8.decode('utf-8') u'Hi \u2119\u01b4\u2602\u210c\xf8\u1f24' Notice that the word "string" is problematic. Both "str" and "unicode" are kinds of strings, and it's tempting to call either or both of them "string," but it's better to use more specific terms to keep things straight. Unfortunately, encoding and decoding can produce errors if the data isn't appropriate for the specified encoding. Here we try to encode our exotic Unicode string to ASCII. It fails because ASCII can only represent charaters in the range 0 to 127, and our Unicode string has code points outside that range. The UnicodeEncodeError that's raised indicates the encoding being used, in the form of the "codec" (short for coder/decoder), and the actual position of the character that caused the problem. Decoding errors Not all byte sequences are valid 2 >>> my_utf8.decode("ascii") Traceback (most recent call last): UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128) >>> "\x78\x9a\xbc\xde\xf0".decode("utf-8") Traceback (most recent call last): return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 1: invalid start byte Decoding can also produce errors. Here we try to decode our UTF-8 string as ASCII and get a UnicodeDecodeError because again, ASCII can only accept values up to 127, and our UTF-8 string has bytes outside that range. Even UTF-8 can't decode any sequence of bytes. Next we try to decode some random junk, and it also produces a UnicodeDecodeError. Actually, one of UTF-8's advantages is that there are invalid sequences of bytes, which helps to build robust systems: mistakes in data won't be accepted as if they were valid. >>> my_unicode.encode("ascii", "replace") 'Hi ??????' >>> my_unicode.encode("ascii", "xmlcharrefreplace") 'Hi &#8473;&#436;&#9730;&#8460;&#248;&#7972;' >>> my_unicode.encode("ascii", "ignore") 'Hi ' When encoding or decoding, you can specify what should happen when the codec can't handle the data. An optional second argument to encode or decode specifies the policy. The default value is "strict", which means raise an error, as we've seen. A value of "replace" means, give me a standard replacement character. When encoding, the replacement character is a question mark, so any code point that can't be encoded using the specified encoding will simply produce a "?". Other error handlers are more useful. "xmlcharrefreplace" produces an HTML/XML character entity reference, so that \u01B4 becomes "&#436;" (hex 01B4 is decimal 436.) This is very useful if you need to output unicode for an HTML file. Notice that different error policies are used for different reasons. "Replace" is a defensive mechanism against data that cannot be interpreted, and loses information. "Xmlcharrefreplace" preserves all the original information, and is used when outputting data where XML escapes are acceptable. You can also specify error handling when decoding. "Ignore" will drop bytes that can't decode properly. "Replace" will insert a Unicode U+FFFD, "REPLACEMENT CHARACTER" for problem bytes. Notice that since the decoder can't decode the data, it doesn't know how many Unicode characters were intended. Decoding our UTF-8 bytes as ASCII produces 16 replacement characters, one for each byte that couldn't be decoded, while those bytes were meant to only produce 6 Unicode characters. Python 2 tries to be helpful when working with unicode and byte strings. If you try to perform a string operation that combines a unicode string with a byte string, Python 2 will automatically decode the byte string to produce a second unicode string, then will complete the operation with the two unicode strings. For example, we try to concatenate a unicode "Hello " with a byte string "world". The result is a unicode "Hello world". On our behalf, Python 2 is decoding the byte string "world" using the ASCII codec. The encoding used for these implicit decodings is the value of sys.getdefaultencoding(). Implicit conversion Mixing bytes and unicode implicitly decodes 2 >>> u"Hello " + "world" u'Hello world' >>> u"Hello " + ("world".decode("ascii")) u'Hello world' >>> sys.getdefaultencoding() 'ascii' The implicit encoding is ASCII because it's the only safe guess: ASCII is so widely accepted, and is a subset of so many encodings, that it's unlikely to produce false positives. Implicit decoding errors 2 >>> u"Hello " + my_utf8 Traceback (most recent call last): UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128) >>> u"Hello " + (my_utf8.decode("ascii")) Traceback (most recent call last): UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128) Of course, these implicit decodings are not immune to decoding errors. If you try to combine a byte string with a unicode string and the byte string can't be decoded as ASCII, then the operation will raise a UnicodeDecodeError. This is the source of those painful UnicodeErrors. Your code inadvertently mixes unicode strings and byte strings, and as long as the data is all ASCII, the implicit conversions silently succeed. Once a non-ASCII character finds its way into your program, an implicit decode will fail, causing a UnicodeDecodeError. Python 2's philosophy was that unicode strings and byte strings are confusing, and it tried to ease your burden by automatically converting between them, just as it does for ints and floats. But the conversion from int to float can't fail, while byte string to unicode string can. Python 2 silently glosses over byte to unicode conversions, making it much easier to write code that deals with ASCII. The price you pay is that it will fail with non-ASCII data. Other implicit conversions 2 >>> "Title: %s" % my_unicode u'Title: Hi \u2119\u01b4\u2602\u210c\xf8\u1f24' >>> u"Title: %s" % my_string u'Title: Hello World' >>> print my_unicode Traceback (most recent call last): UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-8: ordinal not in range(128) >>> my_utf8.encode('utf-8') # silly Traceback (most recent call last): UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128) >>> my_string.encode('utf-8') # silly 'Hello World' There are lots of ways to combine two strings, and all of them will decode bytes to unicode, so you have to watch out for them. Here we use an ASCII format string, with unicode data. The format string will be decoded to unicode, then the formatting performed, resulting in a unicode string. Next we switch the two: A unicode format string and a byte string again combine to produce a unicode string, because the byte string data is decoded as ASCII. Even just attempting to print a unicode string will cause an implicit encoding: output is always bytes, so the unicode string has to be encoded into bytes before it can be printed. The next one is truly confusing: we ask to encode a byte string to UTF-8, and get an error about not being about to decode as ASCII! The problem here is that byte strings can't be encoded: remember encode is how you turn unicode into bytes. So to perform the encoding you want, Python 2 needs a unicode string, which it tries to get by implicitly decoding your bytes as ASCII. So you asked to encode to UTF-8, and you get an error about decoding ASCII. It pays to look carefully at the error, it has clues about what operation is being attempted, and how it failed. Lastly, we encode an ASCII string to UTF-8, which is silly, encode should be used on unicode string. To make it work, Python performs the same implicit decode to get a unicode string we can encode, but since the string is ASCII, it succeeds, and then goes on to encode it as UTF-8, producing the original byte string, since ASCII is a subset of UTF-8. We've seen the source of Unicode pain in Python 2, now let's take a look at Python 3. The biggest change from Python 2 to Python 3 is their treatment of Unicode. Str vs bytes str: a sequence of code points (unicode) bytes: a sequence of bytes 3 >>> my_string = "Hi \u2119\u01b4\u2602\u210c\xf8\u1f24" >>> type(my_string) <class 'str'> >>> my_bytes = b"Hello World" >>> type(my_bytes) <class 'bytes'> Just as in Python 2, Python 3 has two string types, one for unicode and one for bytes, but they are named differently. Now the "str" type that you get from a plain string literal stores unicode, and the "bytes" types stores bytes. You can create a bytes literal with a b prefix. So "str" in Python 2 is now called "bytes," and "unicode" in Python 2 is now called "str". This makes more sense than the Python 2 names, since Unicode is how you want all text stored, and byte strings are only for when you are dealing with bytes. No coercion! Python 3 won’t implicitly change bytes ↔ unicode 3 >>> "Hello " + b"world" Traceback (most recent call last): TypeError: Can't convert 'bytes' object to str implicitly >>> "Hello" == b"Hello" False >>> d = {"Hello": "world"} >>> d[b"Hello"] Traceback (most recent call last): KeyError: b'Hello' The biggest change in the Unicode support in Python 3 is that there is no automatic decoding of byte strings. If you try to combine a byte string with a unicode string, you will get an error all the time, regardless of the data involved! All of those operations I showed where Python 2 silently converted byte strings to unicode strings to complete an operation, every one of them is an error in Python 3. In addition, Python 2 considers a unicode string and a byte string equal if they contain the same ASCII bytes, and Python 3 won't. A consequence of this is that unicode dictionary keys can't be found with byte strings, and vice-versa, as they can be in Python 2. This drastically changes the nature of Unicode pain in Python 3. In Python 2, mixing unicode and bytes succeeds so long as you only use ASCII data. In Python 3, it fails immediately regardless of the data. So Python 2's pain is deferred: you think your program is correct, and find out later that it fails with exotic characters. With Python 3, your code fails immediately, so even if you are only handling ASCII, you have to explicitly deal with the difference between bytes and unicode. Python 3 is strict about the difference between bytes and unicode. You are forced to be clear in your code which you are dealing with. This has been controversial, and can cause you pain. Reading files 3 >>> open("hello.txt", "r").read() 'Hello, world!\n' >>> open("hello.txt", "rb").read() b'Hello, world!\n' >>> open("hi_utf8.txt", "r").read() 'Hi \xe2\u201e\u2122\xc6\xb4\xe2\u02dc\u201a\xe2\u201e\u0152\xc3\xb8\xe1\xbc\xa4' >>> open("hi_utf8.txt", "r", ... encoding=locale.getpreferredencoding()).read() 'Hi \xe2\u201e\u2122\xc6\xb4\xe2\u02dc\u201a\xe2\u201e\u0152\xc3\xb8\xe1\xbc\xa4' >>> open("hi_utf8.txt", "r", encoding="utf-8").read() 'Hi \u2119\u01b4\u2602\u210c\xf8\u1f24' >>> open("hi_utf8.txt", "rb").read() b'Hi \xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\x Because of this new strictness, Python 3 has changed how you read files. Python has always had two modes for reading files: binary and text. In Python 2, it only affected the line endings, and on Unix platforms, even that was a no-op. In Python 3, the two modes produce different results. When you open a file in text mode, either with "r", or by defaulting the mode entirely, the data read from the file is implicitly decoded into Unicode, and you get str objects. If you open a file in binary mode, by supplying "rb" as the mode, then the data read from the file is bytes, with no processing done on them. The implicit conversion from bytes to unicode uses the encoding returned from locale.getpreferredencoding(), and it may not give you the results you expect. For example, when we read hi_utf8.txt, it's being decoded using the locale's preferred encoding, which since I created these samples on Windows, is "cp1252". Like ISO 8859-1, CP-1252 is a one-byte character code that will accept any byte value, so it will never raise a UnicodeDecodeError. That also means that it will happily decode data that isn't actually CP-1252, and produce garbage. To get the file read properly, you should specify an encoding to use. The open() function now has an optional encoding parameter. OK, so how do we deal with all this pain? The good news it that the rules to remember are simple, and they're the same for Python 2 and Python 3. As we saw with Fact of Life #1, the data coming into and going out of your program must be bytes. But you don't need to deal with bytes on the inside of your program. The best strategy is to decode incoming bytes as soon as possible, producing unicode. You use unicode throughout your program, and then when outputting data, encode it to bytes as late as possible. This creates a Unicode sandwich: bytes on the outside, Unicode on the inside. Keep in mind that sometimes, a library you're using may do some of these conversions for you. The library may present you with Unicode input, or will accept Unicode for output, and the library will take care of the edge conversion to and from bytes. For example, Django provides Unicode, as does the json module. The second rule is, you have to know what kind of data you are dealing with. At any point in your program, you need to know whether you have a byte string or a unicode string. This shouldn't be a matter of guessing, it should be by design. In addition, if you have a byte string, you should know what encoding it is if you ever intend to deal with it as text. When debugging your code, you can't simply print a value to see what it is. You need to look at the type, and you may need to look at the repr of the value in order to get to the bottom of what data you have. I said you have to understand what encoding your byte strings are. Here's Fact of Life #4: You can't determine the encoding of a byte string by examining it. You need to know through other means. For example, many protocols include ways to specify the encoding. Here we have examples from HTTP, HTML, XML, and Python source files. You may also know the encoding by prior arrangement, for example, the spec for a data source may specify the encoding. There are ways to guess at the encoding of the bytes, but they are just guesses. The only way to be sure of the encoding is to find it out some other way. Here's an example of our exotic Unicode string, encoded as UTF-8, and then mistakenly decoded in a variety of encodings. As you can see, decoding with an incorrect encoding might succeed, but produce the wrong characters. Your program can't tell it's decoding wrong, only when people try to read the text will you know something has gone wrong. This is a good demonstration of Fact of Life #4: the same stream of bytes is decodable using a number of different encodings. The bytes themselves don't indicate what encoding they use. BTW, there's a term for this garbage display, from the Japanese who have been dealing with this for years and years: Mojibake. Unfortunately, because the encoding for bytes has to be communicated separately from the bytes themselves, sometimes the specified encoding is wrong. For example, you may pull an HTML page from a web server, and the HTTP header claims the page is 8859-1, but in fact, it is encoded with UTF-8. In some cases, the encoding mismatch will succeed and cause mojibake. Other times, the encoding is invalid for the bytes, and will cause a UnicodeError of some sort. It should go without saying: you should explicitly test your Unicode support. To do this, you need challenging Unicode data to pump through your code. If you are an English-only speaker, you may have a problem doing this, because lots of non-ASCII data is hard to read. Luckily, the variety of Unicode code points mean you can construct complex Unicode strings that are still readable by English speakers. Here's an example of overly-accented text, readable pseudo-ASCII text, and upside-down text. One good source of these sorts of strings are various web sites that offer strings like this for teenagers to paste into social networking sites. Depending on your application, you may need to dig deeper into the other complexities in the Unicode world. There are many details that I haven't covered here, and they can be very involved. I call this Fact of Life #5½ because you may not have to deal with any of this. To review, these are the five unavoidable Facts of Life: All input and output of your program is bytes. The world needs more than 256 symbols to communicate text. Your program has to deal with both bytes and Unicode. A stream of bytes can't tell you its encoding. Encoding specifications can be wrong. These are the three Pro Tips to keep in mind as you build your software to keep your code Unicode-clean: Unicode sandwich: keep all text in your program as Unicode, and convert as close to the edges as possible. Know what your strings are: you should be able to explain which of your strings are Unicode, which are bytes, and for your byte strings, what encoding they use. Test your Unicode support. Use exotic strings throughout your test suites to be sure you're covering all the cases. If you follow these tips, you'll write good solid code that deals well with Unicode, and won't fall over no matter how wild the Unicode it encounters.

+ Encoding vs. Encryption vs. Hashing (Aug. 22, 2014, 8:04 a.m.)

Encoding is often confused with encryption and hashing. They are not the same. But before I go into the differences, I’ll first mention how they relate: All three transform data into another format. Both encoding and encryption are reversible, and hashing is not. Let’s take a look at each one: Encoding ascii The purpose of encoding is to transform data so that it can be properly (and safely) consumed by a different type of system, e.g. binary data being sent over email, or viewing special characters on a web page. The goal is not to keep information secret, but rather to ensure that it’s able to be properly consumed. Encoding transforms data into another format using a scheme that is publicly available so that it can easily be reversed. It does not require a key as the only thing required to decode it is the algorithm that was used to encode it. Examples: ASCII, Unicode, URL Encoding, Base64 Encryption ciphertext The purpose of encryption is to transform data in order to keep it secret from others, e.g. sending someone a secret letter that only they should be able to read, or securely sending a password over the Internet. Rather than focusing on usability, the goal is to ensure the data cannot be consumed by anyone other than the intended recipient(s). Encryption transforms data into another format in such a way that only specific individual(s) can reverse the transformation. It uses a key, which is kept secret, in conjunction with the plaintext and the algorithm, in order to perform the encryption operation. As such, the ciphertext, algorithm, and key are all required to return to the plaintext. Examples: AES, Blowfish, RSA Hashing sha512 Hashing serves the purpose of ensuring integrity, i.e. making it so that if something is changed you can know that it’s changed. Technically, hashing takes arbitrary input and produce a fixed-length string that has the following attributes: The same input will always produce the same output. Multiple disparate inputs should not produce the same output. It should not be possible to go from the output to the input. Any modification of a given input should result in drastic change to the hash. Hashing is used in conjunction with authentication to produce strong evidence that a given message has not been modified. This is accomplished by taking a given input, encrypting it with a given key, hashing it, and then encrypting the key with with the recipient’s public key and signing the hash with the sender’s private key. When the recipient opens the message, they can then decrypt the key with their private key, which allows them to decrypt the message. They then hash the message themselves and compare it to the hash that was signed by the sender. If they match it is an unmodified message, sent by the correct person. Examples: SHA-3, MD5 (Now obsolete), etc. Summary Encoding is for maintaining data usability and can be reversed by employing the same algorithm that encoded the content, i.e. no key is used. Encryption is for maintaining data confidentiality and requires the use of a key (kept secret) in order to return to plaintext. Hashing is for validating the integrity of content by detecting all modification thereof via obvious changes to the hash output.

+ Description 2 (Aug. 22, 2014, 8:03 a.m.)

A computer cannot store "letters", "numbers", "pictures" or anything else. The only thing it can store and work with are bits. ----------------------------------------------------------------- A bit can only have two values: yes or no, true or false, 1 or 0 or whatever else you want to call these two values. ----------------------------------------------------------------- To use bits to represent anything at all besides bits, we need rules. We need to convert a sequence of bits into something like letters, numbers and pictures using an encoding scheme, or encoding for short. Like this: 01100010 01101001 01110100 01110011 b i t s ----------------------------------------------------------------- A certain sequence of bits stands for a letter and a letter stands for a certain sequence of bits. If you can keep this in your head for 26 letters or are really fast with looking stuff up in a table, you could read bits like a book. ----------------------------------------------------------------- The above encoding scheme happens to be ASCII. A string of 1s and 0s is broken down into parts of eight bit each (a byte for short). The ASCII encoding specifies a table translating bytes into human readable letters. Here's a short excerpt of that table: bits character 01000001 A 01000010 B 01000011 C 01000100 D 01000101 E 01000110 F ----------------------------------------------------------------- There are 95 human readable characters specified in the ASCII table, including the letters A through Z both in upper and lower case, the numbers 0 through 9, a handful of punctuation marks and characters like the dollar symbol, the ampersand and a few others. It also includes 33 values for things like space, line feed, tab, backspace and so on. These are not printable per se, but still visible in some form and useful to humans directly. A number of values are only useful to a computer, like codes to signify the start or end of a text. ----------------------------------------------------------------- In total there are 128 characters defined in the ASCII encoding, which is a nice round number (for people dealing with computers), since it uses all possible combinations of 7 bits (0000000, 0000001, 0000010 through 1111111).1 And there you have it, the way to represent human-readable text using only 1s and 0s. 01001000 01100101 01101100 01101100 01101111 00100000 01010111 01101111 01110010 01101100 01100100 "Hello World" ----------------------------------------------------------------- character set, charset The set of characters that can be encoded. "The ASCII encoding encompasses a character set of 128 characters." Essentially synonymous to "encoding". ----------------------------------------------------------------- code page A "page" of codes that map a character to a number or bit sequence. A.k.a. "the table". Essentially synonymous to "encoding". ----------------------------------------------------------------- There are many ways to write numbers. 10011111 in binary is 237 in octal is 159 in decimal is 9F in hexadecimal. They all represent the same value, but hexadecimal is shorter and easier to read than binary. I will stick with binary throughout this article to get the point across better and spare the reader one layer of abstraction. Do not be alarmed to see character codes referred to in other notations elsewhere, it's all the same thing. ----------------------------------------------------------------- Now that we know what we're talking about, let's just say it: 95 characters really isn't a lot when it comes to languages. It covers the basics of English, but what about writing a risqué letter in French? A Straßen­übergangs­änderungs­gesetz in German? An invitation to a smörgåsbord in Swedish? Well, you couldn't. Not in ASCII. There's no specification on how to represent any of the letters é, ß, ü, ä, ö or å in ASCII, so you can't use them. ----------------------------------------------------------------- "But look at it," the Europeans said, "in a common computer with 8 bits to the byte, ASCII is wasting an entire bit which is always set to 0! We can use that bit to squeeze a whole 'nother 128 values into that table!" And so they did. But even so, there are more than 128 ways to stroke, slice, slash and dot a vowel. Not all variations of letters and squiggles used in all European languages can be represented in the same table with a maximum of 256 values. So what the world ended up with is a wealth of encoding schemes, standards, de-facto standards and half-standards that all cover a different subset of characters. Somebody needed to write a document about Swedish in Czech, found that no encoding covered both languages and invented one. Or so I imagine it went countless times over. ----------------------------------------------------------------- And not to forget about Russian, Hindi, Arabic, Hebrew, Korean and all the other languages currently in active use on this planet. Not to mention the ones not in use anymore. Once you have solved the problem of how to write mixed language documents in all of these languages, try yourself on Chinese. Or Japanese. Both contain tens of thousands of characters. You have 256 possible values to a byte consisting of 8 bit. Go! ----------------------------------------------------------------- Finally somebody had enough of the mess and set out to forge a ring to bind them all create one encoding standard to unify all encoding standards. This standard is Unicode. It basically defines a ginormous table of 1,114,112 code points that can be used for all sorts of letters and symbols. That's plenty to encode all European, Middle-Eastern, Far-Eastern, Southern, Northern, Western, pre-historian and future characters mankind knows about.2 Using Unicode, you can write a document containing virtually any language using any character you can type into a computer. This was either impossible or very very hard to get right before Unicode came along. ----------------------------------------------------------------- Unicode is big enough to allow for unofficial, private-use areas. ----------------------------------------------------------------- So, how many bits does Unicode use to encode all these characters? None. Because Unicode is not an encoding. Confused? Many people seem to be. ----------------------------------------------------------------- Unicode first and foremost defines a table of code points for characters. That's a fancy way of saying "65 stands for A, 66 stands for B and 9,731 stands for ☃" (seriously, it does). How these code points are actually encoded into bits is a different topic. ----------------------------------------------------------------- To represent 1,114,112 different values, two bytes aren't enough. Three bytes are, but three bytes are often awkward to work with, so four bytes would be the comfortable minimum. But, unless you're actually using Chinese or some of the other characters with big numbers that take a lot of bits to encode, you're never going to use a huge chunk of those four bytes. If the letter "A" was always encoded to 00000000 00000000 00000000 01000001, "B" always to 00000000 00000000 00000000 01000010 and so on, any document would bloat to four times the necessary size. ----------------------------------------------------------------- To optimize this, there are several ways to encode Unicode code points into bits. UTF-32 is such an encoding that encodes all Unicode code points using 32 bits. That is, four bytes per character. It's very simple, but often wastes a lot of space. UTF-16 and UTF-8 are variable-length encodings. If a character can be represented using a single byte (because its code point is a very small number), UTF-8 will encode it with a single byte. If it requires two bytes, it will use two bytes and so on. It has elaborate ways to use the highest bits in a byte to signal how many bytes a character consists of. This can save space, but may also waste space if these signal bits need to be used often. UTF-16 is in the middle, using at least two bytes, growing to up to four bytes as necessary. character encoding bits A UTF-8 01000001 A UTF-16 00000000 01000001 A UTF-32 00000000 00000000 00000000 01000001 あ UTF-8 11100011 10000001 10000010 あ UTF-16 00110000 01000010 あ UTF-32 00000000 00000000 00110000 01000010 ----------------------------------------------------------------- And that's all there is to it. Unicode is a large table mapping characters to numbers and the different UTF encodings specify how these numbers are encoded as bits. Overall, Unicode is yet another encoding scheme. There's nothing special about it, it's just trying to cover everything while still being efficient. And that's A Good Thing. ----------------------------------------------------------------- Code Points Characters are referred to by their "Unicode code point". Unicode code points are written in hexadecimal (to keep the numbers shorter), preceded by a "U+" (that's just what they do, it has no other meaning than "this is a Unicode code point"). The character Ḁ has the Unicode code point U+1E00. In other (decimal) words, it is the 7680th character of the Unicode table. It is officially called "LATIN CAPITAL LETTER A WITH RING BELOW". ----------------------------------------------------------------- Why In God's Name Are My Characters Garbled?! ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ If you open a document and it looks like this, there's one and only one reason for it: Your text editor, browser, word processor or whatever else that's trying to read the document is assuming the wrong encoding. That's all. The document is not broken (well, unless it is, see below), there's no magic you need to perform, you simply need to select the right encoding to display the document. ----------------------------------------------------------------- Now, quick, what encoding is that? If you just shrugged, you'd be correct. Who knows, right‽ Well, let's try to interpret this as ASCII. Hmm, most of these bytes start3 with a 1 bit. If you remember correctly, ASCII doesn't use that bit. So it's not ASCII. What about UTF-8? Hmm, no, most of these sequences are not valid UTF-8.4 So UTF-8 is out, too. Let's try "Mac Roman" (yet another encoding scheme for them Europeans). Hey, all those bytes are valid in Mac Roman. 10000011 maps to "É", 01000111 to "G" and so on. If you read this bit sequence using the Mac Roman encoding, the result is "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ". That looks like a valid string, no? Yes? Maybe? Well, how's the computer to know? Maybe somebody meant to write "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ". For all I know that could be a DNA sequence.5 Unless you have a better suggestion, let's declare this to be a DNA sequence, say this document was encoded in Mac Roman and call it a day. Of course, that unfortunately is complete nonsense. The correct answer is that this text is encoded in the Japanese Shift-JIS encoding and was supposed to read "エンコーディングは難しくない". Well, who'd've thunk? ----------------------------------------------------------------- UTF-8 And ASCII The ingenious thing about UTF-8 is that it's binary compatible with ASCII, which is the de-facto baseline for all encodings. All characters available in the ASCII encoding only take up a single byte in UTF-8 and they're the exact same bytes as are used in ASCII. In other words, ASCII maps 1:1 unto UTF-8. Any character not in ASCII takes up two or more bytes in UTF-8. For most programming languages that expect to parse ASCII, this means you can include UTF-8 text directly in your programs: $string = "漢字"; Saving this as UTF-8 results in this bit sequence: 00100100 01110011 01110100 01110010 01101001 01101110 01100111 00100000 00111101 00100000 00100010 11100110 10111100 10100010 11100101 10101101 10010111 00100010 00111011 Only bytes 12 through 17 (the ones starting with 1) are UTF-8 characters (two characters with three bytes each). All the surrounding characters are perfectly good ASCII. A parser would read this as follows: $string = "11100110 10111100 10100010 11100101 10101101 10010111";

+ Description 1 (Aug. 22, 2014, 8 a.m.)

1-The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. ------------------------------------------------------------- 2-Space was 32, the letter "A" was 65, etc. This could conveniently be stored in 7 bits. ------------------------------------------------------------- 3-Codes below 32 were called unprintable. They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in. ------------------------------------------------------------- 4-All was good, assuming you were an English speaker. ------------------------------------------------------------- 5-Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255. ------------------------------------------------------------- 6-In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. ------------------------------------------------------------- 7-For example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few "multilingual" code pages that could do Esperanto and Galician on the same computer! ------------------------------------------------------------- 8-Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too. Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad. ------------------------------------------------------------- 9-In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense. ------------------------------------------------------------- 10-Until now, we've assumed that a letter maps to some bits which you can store on disk or in memory: A -> 0100 0001 ------------------------------------------------------------- 11-In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story. In Unicode, the letter A is a platonic ideal. It's just floating in heaven. ------------------------------------------------------------- 12-Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. ------------------------------------------------------------- 13-The U+ means "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. You can find them all using the charmap utility on Windows 2000/XP or visiting the Unicode web site: http://www.unicode.org/ ------------------------------------------------------------- 14-There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway. ------------------------------------------------------------- 15-OK, so say we have a string: Hello which, in Unicode, corresponds to these five code points: U+0048 U+0065 U+006C U+006C U+006F. Just a bunch of code points. Numbers, really. We haven't yet said anything about how to store this in memory or represent it in an email message. ------------------------------------------------------------- 16-That's where encodings come in. The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let's just store those numbers in two bytes each. So Hello becomes 00 48 00 65 00 6C 00 6C 00 6F Right? Not so fast! Couldn't it also be: 48 00 65 00 6C 00 6C 00 6F 00 ? Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. ------------------------------------------------------------- 17-For a while it seemed like that might be good enough, but programmers were complaining. "Look at all those zeros!" they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. ------------------------------------------------------------- 18-For this reason alone most people decided to ignore Unicode for several years and in the meantime things got worse. Thus was invented the brilliant concept of UTF-8. ------------------------------------------------------------- 19-UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. ------------------------------------------------------------- 20-In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops.