Efficiently encoding binary data in Unicode

tl;dr: if you wish to efficiently encode binary data as Unicode text,

So anyway

Something which leaps out at me when reading Wikipedia's article on binary-to-text encodings is that apparently "plain text" is synonymous with "ASCII", or more specifically ASCII minus the control characters. This is admittedly a fairly safe definition of plain text, although there's a strong argument to be made that there is no such thing.

But there's another argument that plain text is Unicode strings. An increasing number of text-based systems accept Unicode. This gives us a vastly expanded character set to play with.

This also changes our definition of "efficiency" — in fact it makes it a little more complicated, since there are several ways of encoding Unicode strings in turn. UTF-8-encoded Unicode strings are a superset of ASCII, and here we find that encodings staying inside ASCII are generally the most efficient, bit-for-bit.

But UTF-16-encoded Unicode strings are surprisingly prevalent (Windows, Java) and here we find that constraining ourselves only to ASCII is a bad move. Instead, it's better to use as much of the Basic Multilingual Plane as possible.

In UTF-32 the situation becomes even more extreme. UTF-32 is relatively rarely used because of its overall wastefulness: 11 out of every 32 bits are guaranteed to be zeroes. But when encoding binary data as UTF-32, efficiency is essentially a matter of counting code points. Which is how Twitter measures Tweet length...

Of course there are pitfalls to using arbitrary Unicode characters in your encoding, just as there are when using arbitrary ASCII characters in your encoding. Many characters are unsafe for this purpose (see below). But once we recognise those dangerous areas of the Unicode space and know how to avoid them, we're still left with many code points and many, many possibilities.

So anyway, here are some new encodings I've come up with recently and how they shape up to existing ones. Some of these are jokes, one or two I believe to have some legitimate value.

Comparison

Efficiency ratings are averaged over long inputs. Higher is better.

Encoding Efficiency
UTF‑8 UTF‑16 UTF‑32
ASCII‑constrained Unary / Base1 0% 0% 0%
Binary 13% 6% 3%
Hexadecimal 50% 25% 13%
Base64 75% 38% 19%
Base85 80% 40% 20%
BMP‑constrained HexagramEncode 25% 38% 19%
BrailleEncode 33% 50% 25%
Base32768 63% 94% 47%
Full Unicode Ecoji 31% 31% 31%
Base65536 56% 64% 50%
Base131072 (prototype) 53%+ 53%+ 53%

Observations

If the output text is UTF-8-encoded, existing ASCII-based encodings remain the best choice.

If the output text is UTF-16-encoded, a stalwart encoding such as Base64 is so poor that using I Ching hexagrams has equivalent efficiency and using Braille — which has the added bonus of letting you see the bits — is strictly better. However, the state of the art is Base32768, which demolishes Base64 and the others, offering 94% (to be exact, 15/16ths) efficiency.

If the output is UTF-32-encoded, or if we simply want to count the number of characters in the output, then we can do a little better still with Base65536. Unfortunately UTF-32 is very inefficient: 11 of those 32 bits are never used, most of the million or so remaining code points have not been assigned and many of the assigned code points are unsafe. There's not really much scope for improvement here.

What makes a code point safe?

This merits a longer, dedicated discussion! However, it should be noted that Base85 uses several characters which potentially cannot be considered safe.

Discussion (24)

2016-04-04 23:31:25 by qntm:

Unrelatedly, I really need to alter my CSS, it's plainly not suited to this kind of article. The tables!

2016-04-05 02:11:58 by skztr:

When I saw Tom Scott reference this in one of his recent videos, my first thought was "are things slipping by me that aren't getting published to qntm.org?" My wife, appropriately, made fun of my fanboyism.

2016-04-05 20:44:03 by Bago:

base1 is actually a good choice from a low-level perspective. Because a file's content is encoded in its length, you can read a file with 'stat' rather than having to resort to slow, antiquated functions such as 'fopen'. Efficiency!

2016-04-08 04:43:53 by davidgro:

Bago: Good point, on a system with sparse file support (and perhaps changing it to use 0x00 bytes instead of "A"s) that might actually be implementable as a bizarre form of steganography What I don't understand is how that system encodes multiple bytes - how does the value of 0x01 0x02 differ from 0x02 0x01? If it's just place-value base 256, then what about leading 0 bytes?

2016-04-08 08:15:52 by Daniel H:

I understand why you avoid diacritics, control characters, whitespace, and surrogate pairs, but why do you avoid punctuation? I’m not even sure you need to avoid diacritics if you also follow the normalization rule, although that alone might take out most of them. It seems like you could get a more efficient (though non–power-of-two) UTF-32 encoding which was still Twitter-safe if you relaxed some of those rules. I had the same experience as skztr. Although mentioning Tom Scott in relation to this makes me wonder: what would be the efficiency of an emoji-only encoding?

2016-04-08 12:09:52 by qntm:

> What I don't understand is how that system encodes multiple bytes - how does the value of 0x01 0x02 differ from 0x02 0x01? If it's just place-value base 256, then what about leading 0 bytes? It's not just place-value base 256; imagine if all possible buffers were placed in order of length and then in order of content, then indexed. * An empty buffer becomes "". * 0x00 becomes "A". * 0x01 becomes "AA". * 0xFF becomes "A" repeated 256 times. * 0x00 0x00 becomes "A" repeated 257 times. * 0x01 0x02 becomes "A" repeated 515 times. * 0x02 0x01 becomes "A" repeated 770 times. And so on.

2016-04-08 12:28:38 by qntm:

> I understand why you avoid diacritics, control characters, whitespace, and surrogate pairs, but why do you avoid punctuation? Mainly because I think it's desirable to be able to place a Base65536 (or whatever) string in quotes, or in general use a Base65536 string in normal text, without making it ambiguous whether the quotes are part of the string. For comparison, valid Base85 strings can contain double and single quotes. > I’m not even sure you need to avoid diacritics if you also follow the normalization rule, although that alone might take out most of them. Probably correct, but I also want to avoid diacritics because of the problem if the output text begins with a diacritic. I'd rather leave this in as an explicit constraint, even if it means checking twice, than rely on all diacritics being subject to normalization alterations, which, even if it is true for now, could just be a coincidence. > It seems like you could get a more efficient (though non–power-of-two) UTF-32 encoding which was still Twitter-safe if you relaxed some of those rules. At this point we hit a point of rapidly diminishing returns. For a 17-bit encoding, "Base131072", (an increase in efficiency from 50% to 53%), we need to unlock another 65,536 code points somehow. It might be possible in future versions of Unicode...

2016-04-09 09:59:41 by Andrew:

UTF-16 is the problem. Going past 21 bits would require adding more surrogate-like codepoints, somewhere in the astral planes since the BMP doesn't have room for it, and it would be inefficient and violate a bunch of assumptions. UTF-8 and UTF-32 were originally capable of much more, and re-extending them would break very little, but UTF-16 is one big "oops, my bad" on top of UCS-2. It's kind of funny, but the reason you see it adopted in Windows and Java is because both of them were so eager to get Unicode that they implemented it before the standard was really fully baked. If they'd waited just a little longer, UTF-16 probably would never have caught on; we would use UTF-8, UTF-32, and maybe SCSU, and JavaScript wouldn't be so confused about what the length of strings actually is.

2016-04-10 17:16:45 by Daniel H:

> At this point we hit a point of rapidly diminishing returns. For a 17-bit encoding, "Base131072", (an increase in efficiency from 50% to 53%), we need to unlock another 65,536 code points somehow. It might be possible in future versions of Unicode... Yeah, I saw that on the GitHub page. That’s what I meant by non–power-of-two: it could have a number of characters which is not an integral power of two, and thus it would not have an integer number of bits per code point. Just like base85 has 6.4 bits per code point, you could use base92240 (or base101064, at the cost of XML safety) and get almost 16.5 (or over 16.6) bits per code point. Also, Python3 (at least) distinguishes between base85 and ascii85; the latter is what you discuss here while the former uses a different character set. It’s still not safe for XML, but it avoids backslash and quotes. I think Git uses (what Python calls) base85 in some places, but I’m not sure.

2016-05-23 17:25:15 by KimikoMuffin:

Now I'm kind of interested in implementing Base32768 in, say, C#. I'm only 99.99% sure I fully grok the way it works, though; does it just arrange the thirty-two 1024-character blocks in code-point order, and map the 15-bit value to them?

2016-06-01 15:36:22 by qntm:

It's 1024 blocks of 32 characters. For each 15-bit number, the first 10 bits identify the block and the final 5 bits give the offset into the block.

2016-07-10 04:04:45 by saxbophone:

Interesting ideas - I love the detail you have gone into! Base-131072! :D This is an area of great interest to me, at the moment I'm currently dabbling away creating a library to encode arbitrary base-to-base a la binary to text style, but y'know, not limited to binary and text ;)

2018-05-08 22:45:23 by Erics:

The efficiency you talk about it space efficiency. But what about time (computation) efficiency. From other discussion I gather that increasing the N in baseN increases the computation cost signficantly. However I don't know how utf-8 vs 16 vs 32 might interact with this. Also in contexts where such encoding is needed there is also often the ability to use compression, such as gzip, which may devalue space efficiency concerns in favor of time efficiency concerns. Base64 compresses quite well, I understand. So I think your analysis is incomplete without data for time efficiency (encoding and decoding) and compression.

2018-05-08 23:12:49 by qntm:

Barring lousy implementations, computation time is linear in the size of the input. Generally for every few bytes of input we are performing one constant-time lookup to generate a few bytes of output. Encoding data in any of the main three Unicode Transformation Format involves only extremely fast bit shifting. But I'm not really sure how gzip figures into this. If you can send binary, why not just send the binary?

2019-02-26 10:21:49 by ???:

base131072 http://unicode.org/versions/Unicode10.0.0/ does it work now?

2019-02-26 12:02:14 by qntm:

We're actually up to Unicode 11 now but the answer is still "No". The set of available safe characters is only growing very slowly. It might be 20+ years before Base131072 becomes possible, and 20 more before it can be done in an efficient way using large blocks of ~256 characters, instead of a huge lookup table.

2019-04-06 11:16:45 by muvlon:

> If the output text is UTF-8-encoded, existing ASCII-based encodings remain the best choice. I've thought about this for a while, and have realized that it is not strictly true. It is true if you restrict yourself to encodings that emit a fixed number of code points for every unit of input, as your encodings do. That restriction also means that one can only lose by using non-BMP code points in an encoding targeting UTF-16, as those need surrogate pairs to encode. From an information-theoretic point of view, UTF-8 has strictly more entropy than ASCII, and there must be ways to exploit this for encoding purposes. However, it means we must venture into the domain of "unequal letter cost coding". I believe that a practical encoding optimized for UTF-8 that uses only ASCII characters used in base64 or base85 and sufficiently "safe" non-ASCII characters and that beats base64 or base85 for efficiency can be constructed.

2019-04-06 12:43:24 by qntm:

I think what you're talking about is compressing the binary before encoding it, which is outside of the scope of this essay. If not, I would have to see a worked example of what you're talking about. I'm not sure what you mean by UTF-8 having strictly more entropy than ASCII.

2019-04-07 10:50:59 by muvlon:

No, I'm not talking about compression. I think my terminology is just a bit off. I'm still talking about encoding arbitrary bytes without making any assumptions about their statistical properties. By "entropy", I mean that there are more n-byte sequences that are valid UTF-8 than those that are valid ASCII, as UTF-8 is a strict superset of ASCII. Therefore, UTF-8 text is strictly "less predictable" than ASCII text and can carry more information. The same is still true if you restrict yourself to a certain subset of "safe" characters, as long as it includes any non-ASCII characters. There are 64^4 = 2^24 = 16777216 4-byte sequences that consist only of base64 ASCII, which means a 4-byte chunk of base64 encoded data has an entropy of H = 24Sh. However, there are 42897201 4-byte sequences that consist only of base64 ASCII characters and UTF-8 characters from the "letter","number" or "symbol, other" categories, more than twice as many. The theoretical information content is H = 25.354Sh, or an efficiency of 25.354/32 = 79.2%. A (silly, impractical) encoding could look like this: enumerate all those 4-byte sequences and store them in a table. Read input 25 bits at a time and use that 25-bit number as an index into that table. Output that 32-bit sequence. This yields 78.125% efficiency while being no more hazardous to UTF-8 based software than base64. I think I can come up with a more practical one though. If I do, I'll post a link to the implementation here.

2019-04-07 15:35:14 by qntm:

Ah, I see what you're saying about entropy. Yes, my original statement is incorrect, you can express slightly more data that way. According to my notes about how many code points are "safe" in the ranges we're talking about, the figures to consider are as follows: * 1-byte: 7 bits free, 2^6 combinations safe to use * 2-byte: 5 + 6 = 11 bits apparently free, 2^10 combinations safe to use * 3-byte: 4 + 6 + 6 = 16 bits apparently free, 2^15 combinations safe to use * 4-byte: 3 + 6 + 6 + 6 = 21 bits apparently free, 2^16 combinations safe to use It looks like the gains in efficiency might be marginal, but you're right, they're definitely there.

2019-04-07 15:47:10 by qntm:

Yep, you're 100% right. 25 bits per 4 bytes of UTF-8 output is definitely possible. 78.125% efficient. Versus 24 bits per 4 bytes with Base64, which is only 75% efficient. Marginal but potentially worth doing!

2019-04-07 20:17:36 by qntm:

From some numerical investigation with a spreadsheet it looks to me like 25 bits per 4 bytes of UTF-8 output is the limit. It's definitely not going to be possible to get 26 bits. To be more specific, with my current assumptions about the number of safe UTF-8 characters of each size, efficiency tops out at around 79.3%.

2021-09-18 18:21:34 by Unicode:

Unicode 14 juhu ^^ Maybe base131072

2022-07-27 03:36:50 by Dannii:

Base32768 looks ideal for storing data in localStorage, so thanks!

New comment by :

Plain text only. Line breaks become <br/>
The square root of minus one: