tl;dr: if you wish to efficiently encode binary data as Unicode text,
Something which leaps out at me when reading Wikipedia's article on binary-to-text encodings is that apparently "plain text" is synonymous with "ASCII", or more specifically ASCII minus the control characters. This is admittedly a fairly safe definition of plain text, although there's a strong argument to be made that there is no such thing.
But there's another argument that plain text is Unicode strings. An increasing number of text-based systems accept Unicode. This gives us a vastly expanded character set to play with.
This also changes our definition of "efficiency" — in fact it makes it a little more complicated, since there are several ways of encoding Unicode strings in turn. UTF-8-encoded Unicode strings are a superset of ASCII, and here we find that encodings staying inside ASCII are generally the most efficient, bit-for-bit.
But UTF-16-encoded Unicode strings are surprisingly prevalent (Windows, Java) and here we find that constraining ourselves only to ASCII is a bad move. Instead, it's better to use as much of the Basic Multilingual Plane as possible.
In UTF-32 the situation becomes even more extreme. UTF-32 is relatively rarely used because of its overall wastefulness: 11 out of every 32 bits are guaranteed to be zeroes. But when encoding binary data as UTF-32, efficiency is essentially a matter of counting code points. Which is how Twitter measures Tweet length...
Of course there are pitfalls to using arbitrary Unicode characters in your encoding, just as there are when using arbitrary ASCII characters in your encoding. Many characters are unsafe for this purpose (see below). But once we recognise those dangerous areas of the Unicode space and know how to avoid them, we're still left with many code points and many, many possibilities.
So anyway, here are some new encodings I've come up with recently and how they shape up to existing ones. Some of these are jokes, one or two I believe to have some legitimate value.
Efficiency ratings are averaged over long inputs. Higher is better.
Encoding | Efficiency | |||
---|---|---|---|---|
UTF‑8 | UTF‑16 | UTF‑32 | ||
ASCII‑constrained | Unary / Base1 | 0% | 0% | 0% |
Binary | 13% | 6% | 3% | |
Hexadecimal | 50% | 25% | 13% | |
Base64 | 75% | 38% | 19% | |
Base85 | 80% | 40% | 20% | |
BMP‑constrained | HexagramEncode | 25% | 38% | 19% |
BrailleEncode | 33% | 50% | 25% | |
Base32768 | 63% | 94% | 47% | |
Full Unicode | Ecoji | 31% | 31% | 31% |
Base65536 | 56% | 64% | 50% | |
Base131072 (prototype) | 53%+ | 53%+ | 53% |
If the output text is UTF-8-encoded, existing ASCII-based encodings remain the best choice.
If the output text is UTF-16-encoded, a stalwart encoding such as Base64 is so poor that using I Ching hexagrams has equivalent efficiency and using Braille — which has the added bonus of letting you see the bits — is strictly better. However, the state of the art is Base32768, which demolishes Base64 and the others, offering 94% (to be exact, 15/16ths) efficiency.
If the output is UTF-32-encoded, or if we simply want to count the number of characters in the output, then we can do a little better still with Base65536. Unfortunately UTF-32 is very inefficient: 11 of those 32 bits are never used, most of the million or so remaining code points have not been assigned and many of the assigned code points are unsafe. There's not really much scope for improvement here.
This merits a longer, dedicated discussion! However, it should be noted that Base85 uses several characters which potentially cannot be considered safe.
Discussion (24)
2016-04-04 23:31:25 by qntm:
2016-04-05 02:11:58 by skztr:
2016-04-05 20:44:03 by Bago:
2016-04-08 04:43:53 by davidgro:
2016-04-08 08:15:52 by Daniel H:
2016-04-08 12:09:52 by qntm:
2016-04-08 12:28:38 by qntm:
2016-04-09 09:59:41 by Andrew:
2016-04-10 17:16:45 by Daniel H:
2016-05-23 17:25:15 by KimikoMuffin:
2016-06-01 15:36:22 by qntm:
2016-07-10 04:04:45 by saxbophone:
2018-05-08 22:45:23 by Erics:
2018-05-08 23:12:49 by qntm:
2019-02-26 10:21:49 by ???:
2019-02-26 12:02:14 by qntm:
2019-04-06 11:16:45 by muvlon:
2019-04-06 12:43:24 by qntm:
2019-04-07 10:50:59 by muvlon:
2019-04-07 15:35:14 by qntm:
2019-04-07 15:47:10 by qntm:
2019-04-07 20:17:36 by qntm:
2021-09-18 18:21:34 by Unicode:
2022-07-27 03:36:50 by Dannii: