Efficiently encoding binary data in Unicode

Something which leaps out at me when reading Wikipedia's article on binary-to-text encodings is that apparently "plain text" is synonymous with "ASCII", or more specifically ASCII minus the control characters. This is admittedly a fairly safe definition of plain text, although there's a strong argument to be made that there is no such thing.

But there's another argument that plain text is Unicode strings. An increasing number of text-based systems accept Unicode. This gives us a vastly expanded character set to play with.

This also changes our definition of "efficiency" — in fact it makes it a little more complicated, since there are several ways of encoding Unicode strings in turn. UTF-8-encoded Unicode strings are a superset of ASCII, and here we find that encodings staying inside ASCII are generally the most efficient, bit-for-bit.

But UTF-16-encoded Unicode strings are surprisingly prevalent (Windows, Java) and here we find that constraining ourselves only to ASCII is a bad move. Instead, it's better to use as much of the Basic Multilingual Plane as possible.

In UTF-32 the situation becomes even more extreme. UTF-32 is relatively rarely used because of its overall wastefulness: 11 out of every 32 bits are guaranteed to be zeroes. But when encoding binary data as UTF-32, efficiency is essentially a matter of counting code points. Which is how Twitter measures Tweet length...

Of course there are pitfalls to using arbitrary Unicode characters in your encoding, just as there are when using arbitrary ASCII characters in your encoding. Many characters are unsafe for this purpose (see below). But once we recognise those dangerous areas of the Unicode space and know how to avoid them, we're still left with many code points and many, many possibilities.

So anyway, here are some new encodings I've come up with recently and how they shape up to existing ones. Some of these are jokes, one or two I believe to have some legitimate value.

Comparison

Efficiency ratings are averaged over long inputs. Higher is better.

Encoding Implementation Efficiency
UTF‑8 UTF‑16 UTF‑32
ASCII‑constrained Unary base1 0% 0% 0%
Binary everywhere 13% 6% 3%
Hexadecimal everywhere 50% 25% 13%
Base64 everywhere 75% 38% 19%
Base85 everywhere 80% 40% 20%
BMP‑constrained HexagramEncode hexagram-encode 25% 38% 19%
BrailleEncode braille-encode 33% 50% 25%
Base32768 base32768 63% 94% 47%
Full Unicode Base65536 base65536 56% 64% 50%
Base131072 base131072 (prototype) 53%+ 53%+ 53%

Observations

If the output text is UTF-8-encoded, existing ASCII-based encodings remain the best choice.

If the output text is UTF-16-encoded, a stalwart encoding such as Base64 is so poor that using I Ching hexagrams has equivalent efficiency and using Braille — which has the added bonus of letting you see the bits — is strictly better. However, the state of the art is Base32768, which demolishes Base64 and the others, offering 94% (to be exact, 15/16ths) efficiency.

If the output is UTF-32-encoded, or if we simply want to count the number of characters in the output, then we can do a little better still with Base65536. Unfortunately UTF-32 is very inefficient: 11 of those 32 bits are never used, most of the million or so remaining code points have not been assigned and many of the assigned code points are unsafe. There's not really much scope for improvement here.

What makes a code point safe?

Safe Unicode code points are considered to be those which:

  • Are assigned. Unassigned code points have unpredictable properties.
  • Fall into the Letter, Number or Symbol General Categories. No Separators (whitespace), no Punctuation, no Marks (including combining diacritics), no Other (including control characters, private use characters and surrogates). (Note that Base85 uses several characters falling into these dangerous categories.)
  • Are stable when subjected to all forms of Unicode normalization. Which is to say:

Strings made up from these code points can be passed through most text-handling systems without being altered.

Code point range Code points Unicode 8.0 Unicode 9.0
Assigned Safe Assigned Safe
U+0000 to U+007F 128 128 71 128 71
U+0080 to U+07FF 1,920 1,856 1,086 1,856 1,086
U+0800 to U+FFFF 63,488 61,645 37,116 61,701 37,151
U+10000 to U+10FFFF 1,048,576 196,624 62,791 204,068 70,089
Total 1,114,112 260,253 101,064 267,753 108,397

Back to Code
Back to Things Of Interest

Discussion (12)

2016-04-04 22:31:25 by qntm:

Unrelatedly, I really need to alter my CSS, it's plainly not suited to this kind of article. The tables!

2016-04-05 01:11:58 by skztr:

When I saw Tom Scott reference this in one of his recent videos, my first thought was "are things slipping by me that aren't getting published to qntm.org?"

My wife, appropriately, made fun of my fanboyism.

2016-04-05 19:44:03 by Bago:

base1 is actually a good choice from a low-level perspective. Because a file's content is encoded in its length, you can read a file with 'stat' rather than having to resort to slow, antiquated functions such as 'fopen'. Efficiency!

2016-04-08 03:43:53 by davidgro:

Bago: Good point, on a system with sparse file support (and perhaps changing it to use 0x00 bytes instead of "A"s) that might actually be implementable as a bizarre form of steganography

What I don't understand is how that system encodes multiple bytes - how does the value of 0x01 0x02 differ from 0x02 0x01? If it's just place-value base 256, then what about leading 0 bytes?

2016-04-08 07:15:52 by Daniel H:

I understand why you avoid diacritics, control characters, whitespace, and surrogate pairs, but why do you avoid punctuation? I’m not even sure you need to avoid diacritics if you also follow the normalization rule, although that alone might take out most of them. It seems like you could get a more efficient (though non–power-of-two) UTF-32 encoding which was still Twitter-safe if you relaxed some of those rules.

I had the same experience as skztr. Although mentioning Tom Scott in relation to this makes me wonder: what would be the efficiency of an emoji-only encoding?

2016-04-08 11:09:52 by qntm:

> What I don't understand is how that system encodes multiple bytes - how does the value of 0x01 0x02 differ from 0x02 0x01? If it's just place-value base 256, then what about leading 0 bytes?

It's not just place-value base 256; imagine if all possible buffers were placed in order of length and then in order of content, then indexed.

* An empty buffer becomes "".
* 0x00 becomes "A".
* 0x01 becomes "AA".
* 0xFF becomes "A" repeated 256 times.
* 0x00 0x00 becomes "A" repeated 257 times.
* 0x01 0x02 becomes "A" repeated 515 times.
* 0x02 0x01 becomes "A" repeated 770 times.

And so on.

2016-04-08 11:28:38 by qntm:

> I understand why you avoid diacritics, control characters, whitespace, and surrogate pairs, but why do you avoid punctuation?

Mainly because I think it's desirable to be able to place a Base65536 (or whatever) string in quotes, or in general use a Base65536 string in normal text, without making it ambiguous whether the quotes are part of the string. For comparison, valid Base85 strings can contain double and single quotes.

> I’m not even sure you need to avoid diacritics if you also follow the normalization rule, although that alone might take out most of them.

Probably correct, but I also want to avoid diacritics because of the problem if the output text begins with a diacritic. I'd rather leave this in as an explicit constraint, even if it means checking twice, than rely on all diacritics being subject to normalization alterations, which, even if it is true for now, could just be a coincidence.

> It seems like you could get a more efficient (though non–power-of-two) UTF-32 encoding which was still Twitter-safe if you relaxed some of those rules.

At this point we hit a point of rapidly diminishing returns. For a 17-bit encoding, "Base131072", (an increase in efficiency from 50% to 53%), we need to unlock another 65,536 code points somehow. It might be possible in future versions of Unicode...

2016-04-09 08:59:41 by Andrew:

UTF-16 is the problem. Going past 21 bits would require adding more surrogate-like codepoints, somewhere in the astral planes since the BMP doesn't have room for it, and it would be inefficient and violate a bunch of assumptions. UTF-8 and UTF-32 were originally capable of much more, and re-extending them would break very little, but UTF-16 is one big "oops, my bad" on top of UCS-2.

It's kind of funny, but the reason you see it adopted in Windows and Java is because both of them were so eager to get Unicode that they implemented it before the standard was really fully baked. If they'd waited just a little longer, UTF-16 probably would never have caught on; we would use UTF-8, UTF-32, and maybe SCSU, and JavaScript wouldn't be so confused about what the length of strings actually is.

2016-04-10 16:16:45 by Daniel H:

> At this point we hit a point of rapidly diminishing returns. For a 17-bit encoding, "Base131072", (an increase in efficiency from 50% to 53%), we need to unlock another 65,536 code points somehow. It might be possible in future versions of Unicode...

Yeah, I saw that on the GitHub page. That’s what I meant by non–power-of-two: it could have a number of characters which is not an integral power of two, and thus it would not have an integer number of bits per code point. Just like base85 has 6.4 bits per code point, you could use base92240 (or base101064, at the cost of XML safety) and get almost 16.5 (or over 16.6) bits per code point.

Also, Python3 (at least) distinguishes between base85 and ascii85; the latter is what you discuss here while the former uses a different character set. It’s still not safe for XML, but it avoids backslash and quotes. I think Git uses (what Python calls) base85 in some places, but I’m not sure.

2016-05-23 16:25:15 by KimikoMuffin:

Now I'm kind of interested in implementing Base32768 in, say, C#. I'm only 99.99% sure I fully grok the way it works, though; does it just arrange the thirty-two 1024-character blocks in code-point order, and map the 15-bit value to them?

2016-06-01 14:36:22 by qntm:

It's 1024 blocks of 32 characters. For each 15-bit number, the first 10 bits identify the block and the final 5 bits give the offset into the block.

2016-07-10 03:04:45 by saxbophone:

Interesting ideas - I love the detail you have gone into! Base-131072! :D

This is an area of great interest to me, at the moment I'm currently dabbling away creating a library to encode arbitrary base-to-base a la binary to text style, but y'know, not limited to binary and text ;)