Something which leaps out at me when reading Wikipedia's article on binary-to-text encodings is that apparently "plain text" is synonymous with "ASCII", or more specifically ASCII minus the control characters. This is admittedly a fairly safe definition of plain text, although there's a strong argument to be made that there is no such thing.
But there's another argument that plain text is Unicode strings. An increasing number of text-based systems accept Unicode. This gives us a vastly expanded character set to play with.
This also changes our definition of "efficiency" — in fact it makes it a little more complicated, since there are several ways of encoding Unicode strings in turn. UTF-8-encoded Unicode strings are a superset of ASCII, and here we find that encodings staying inside ASCII are generally the most efficient, bit-for-bit.
But UTF-16-encoded Unicode strings are surprisingly prevalent (Windows, Java) and here we find that constraining ourselves only to ASCII is a bad move. Instead, it's better to use as much of the Basic Multilingual Plane as possible.
In UTF-32 the situation becomes even more extreme. UTF-32 is relatively rarely used because of its overall wastefulness: 11 out of every 32 bits are guaranteed to be zeroes. But when encoding binary data as UTF-32, efficiency is essentially a matter of counting code points. Which is how Twitter measures Tweet length...
Of course there are pitfalls to using arbitrary Unicode characters in your encoding, just as there are when using arbitrary ASCII characters in your encoding. Many characters are unsafe for this purpose (see below). But once we recognise those dangerous areas of the Unicode space and know how to avoid them, we're still left with many code points and many, many possibilities.
So anyway, here are some new encodings I've come up with recently and how they shape up to existing ones. Some of these are jokes, one or two I believe to have some legitimate value.
Efficiency ratings are averaged over long inputs. Higher is better.
If the output text is UTF-8-encoded, existing ASCII-based encodings remain the best choice.
If the output text is UTF-16-encoded, a stalwart encoding such as Base64 is so poor that using I Ching hexagrams has equivalent efficiency and using Braille — which has the added bonus of letting you see the bits — is strictly better. However, the state of the art is Base32768, which demolishes Base64 and the others, offering 94% (to be exact, 15/16ths) efficiency.
If the output is UTF-32-encoded, or if we simply want to count the number of characters in the output, then we can do a little better still with Base65536. Unfortunately UTF-32 is very inefficient: 11 of those 32 bits are never used, most of the million or so remaining code points have not been assigned and many of the assigned code points are unsafe. There's not really much scope for improvement here.
What makes a code point safe?
Safe Unicode code points are considered to be those which:
- Are assigned. Unassigned code points have unpredictable properties.
- Fall into the Letter, Number or Symbol General Categories. No Separators (whitespace), no Punctuation, no Marks (including combining diacritics), no Other (including control characters, private use characters and surrogates). (Note that Base85 uses several characters falling into these dangerous categories.)
- Are stable when subjected to all forms of Unicode normalization. Which is to say:
Strings made up from these code points can be passed through most text-handling systems without being altered.
|Code point range||Code points||Unicode 8.0||Unicode 9.0|
|U+0000 to U+007F||128||128||71||128||71|
|U+0080 to U+07FF||1,920||1,856||1,086||1,856||1,086|
|U+0800 to U+FFFF||63,488||61,645||37,116||61,701||37,151|
|U+10000 to U+10FFFF||1,048,576||196,624||62,791||204,068||70,089|