Base64 is used to encode arbitrary binary data as "plain" text using a small, extremely safe repertoire of 64 (well, 65) characters. Base64 remains highly suitable to text systems where the range of characters available is very small — i.e., anything still constrained to plain ASCII. When the resulting text is output as UTF-8, Base64 encodes 3 bytes of data per 4 bytes of output, for an efficiency rating of 75%, which is about as good as it gets.
However, now that Unicode rules the world, the range of characters available to us is often significantly larger. This increases our expressive power and in many situations it increases the amount of data which can be encoded in a Unicode string. This led to the creation of:
Choosing characters to use for these encodings might at first sound like a simple task. Naïvely, Base65536 could even be a one-liner! But in fact there are many things which can make a Unicode character unsuitable for this purpose, and we need to be very careful.
So:
The more we know about Unicode, the more complicated this question becomes.
Perhaps the best way to start to answer this question is to list the characters which would be considered unsafe, and the reasons why.
No unassigned (A.K.A. "reserved") code points. Unassigned code points have unpredictable, potentially undesirable characteristics (see below) if they ever do get assigned. As of Unicode 10.0, this constrains us to 276,271 code points from the full 1,114,112-code point range.
No non-characters, as these are reserved for internal use by Unicode.
No private-use characters, as these may have undesirable characteristics assigned to them by their private users.
No surrogates. These are intended to be used in pairs to encode non-BMP Unicode characters in UTF-16; using them as part of our encoding would probably involve using them individually, potentially raising issues if our encoded string is sent as UTF-16 to a recipient which is expecting something well-formed, or if our encoded string makes use of the actual non-BMP Unicode characters themselves.
No format characters. This includes zero-width spaces, soft hyphens and bidirectionality controls. These are frequently unprintable.
No control characters. This includes nulls, bell characters and other weird unprintable ASCII characters like U+007F DELETE. In general, anything unprintable is to be avoided.
We will also be avoiding control characters like tabs, carriage returns and line feeds, as well as separator characters such as spaces; in general we will avoid all whitespace characters. Whitespace may be eliminated or corrupted when the text is passed through, for example, an XML document. Also, a person trying to select that text may accidentally miss the whitespace, particularly if the whitespace is leading or trailing. Plus, it's desirable to be able to break the text up, e.g. wrapping with a line break every 80 characters, without corrupting the text's meaning. So, ideally, we should be able to ignore whitespace in the text when decoding it.
No punctuation characters, including hyphens, opening and closing bracket characters and initial and final quotes. This will mean that our encoded Unicode string can be safely put inside brackets or quotes if need be, without needing to be escaped, without causing ambiguity or inadvertently terminating the quoted or bracketed string.
No combining characters, including diacritics. These are hazardous if our encoding allows a combining character to appear first in the text. It's simpler to discard them altogether.
Note that the above constraints rule out several entire General Categories of Unicode characters: "Mark", "Punctuation", "Separator" and "Other". This leaves the General Categories "Symbol", "Number" and "Letter".
There's one other constraint, which is that characters must survive all forms of normalization.
This final point is the most difficult to satisfy. Unicode has four "normalization forms": NFD, NFC, NFKD and NFKC. Applying any of these four normalization processes to a Unicode string can cause the sequence of code points to alter, which for our purposes constitutes data corruption. We would like our encoded data to survive all forms of normalization.
Unicode Standard Annex #15, UNICODE NORMALIZATION FORMS gives more information about this, including the following incredibly valuable facts:
A string normalized under one version of Unicode remains normalized under future versions, provided it uses no unassigned code points. So if we get this right once, we don't need to worry about future changes to Unicode making it wrong again.
Normalization Forms are not closed under string concatenation. If more text is put at the beginning or the end of our text, it could not only change but corrupt the data. However, Base64 has this same issue. As long as the text is protected by delimiters/brackets/whitespace, it should be fine.
Substrings of normalized strings are still normalized, which means a "safe" text can be broken into several smaller texts without risk.
Many code points are stable with respect to a particular Normalization Form.
Within the Unicode standard, every single code point has a large number of properties associated with it. Information about these properties is found in the Unicode Character Database (documentation). The machine-readable data itself is here.
One of these properties is Canonical_Combining_Class
, (documentation), which explains how, if at all, the character combines with other characters. The majority of characters have a default canonical combining class of Not_Reordered
(0).
Four other properties, NFD_Quick_Check
, NFKD_Quick_Check
, NFC_Quick_Check
and NFKC_Quick_Check
(data), are the "Quick Check" properties for each of the Normalization Forms. A value of "Yes" indicates that the character is unchanged by that Normalization Form.
As we see here, a code point is considered stable under a Normalization Form if it has a canonical combining class of 0 and a Quick Check value of "Yes". So all we need to do is parse this data and analyse it to get a full list of the safe code points.
Visible space taken up by the data on the screen. Judicious use of Zalgo-esque diacritics could serve to decrease the physical space the text takes up on the screen, to the extent that an arbitrary amount of data could be crammed into a single character. However, this comes at the expense of code point length, due to the relative scarcity of combining diacritics. It would also make the encoding more complex, and very difficult to harden against normalization.
One approach could be to have a single "X" with a Base1-encoded number of combining acute accents above it. E.g. X with 1,217 accents expresses the 2-byte sequence [0x03 0xc0].
Humans trying to write the data out by hand on paper, then input the data again. Restricting ourselves only to the characters which would survive a round trip through someone's handwriting, even Base64 would need to be cut down severely due to the visual similarities between, for example, "l", "L" and "1", "n", "u" and "r" and "o", "O" and "0". For an example of an encoding designed with this constraint in mind, see Base32.
Byte length in any particular encoding. This doesn't affect the "safeness" of any particular code point, although it does constrain which code points we examine.
Here is a small JavaScript project, safe-code-point
, which you can use to determine whether a code point is safe.
Unicode version | Code points | |||
---|---|---|---|---|
Assigned | Safe | Safe (Letter) | Safe (Letter, other) | |
4.1.0 | 237,236 | 79,607 | 76,107 | 75,000 |
5.0.0 | 238,605 | 80,895 | 77,200 | 76,038 |
5.1.0 | 240,229 | 82,246 | 78,155 | 76,762 |
5.2.0 | 246,877 | 88,537 | 84,231 | 82,816 |
6.0.0 | 248,965 | 90,522 | 85,206 | 83,772 |
6.1.0 | 249,697 | 90,927 | 85,554 | 84,096 |
6.2.0 | 249,698 | 90,928 | 85,554 | 84,096 |
6.3.0 | 249,703 | 90,924 | 85,554 | 84,096 |
7.0.0 | 252,537 | 93,510 | 87,260 | 85,658 |
8.0.0 | 260,253 | 101,064 | 94,126 | 92,240 |
9.0.0 | 267,753 | 108,397 | 101,301 | 99,264 |
10.0.0 | 276,271 | 116,813 | 109,628 | 107,590 |
11.0.0 | 276,955 | 117,422 | 109,954 | 107,755 |
12.0.0 | 277,509 | 117,927 | 110,178 | 107,957 |
12.1.0 | 277,510 | 117,927 | 110,178 | 107,957 |
13.0.0 | 283,440 | 123,813 | 115,775 | 113,547 |
14.0.0 | 284,278 | 124,456 | 116,231 | 113,876 |
15.0.0 | 288,767 | 128,811 | 120,517 | 118,155 |
The set of safe code points is gradually expanding with each fresh version of the Unicode standard. As we've shown, and as we'd expect from a well-specified standard, this doesn't have detrimental effects on our existing encodings, and may even eventually enable new, more efficient ones.
Other than this, determining whether a Unicode code point is safe for use in a data encoding is essentially a solved problem at this point.
Discussion (22)
2017-11-10 15:35:38 by Sam:
2017-11-10 15:37:25 by qntrn:
2017-11-10 16:01:02 by qntm:
2017-11-13 13:55:18 by qոtm:
2017-11-14 19:12:57 by Daniel:
2017-11-15 03:10:10 by hus:
2017-11-19 09:43:28 by drubber:
2018-01-07 09:54:28 by Samech:
2018-04-16 12:24:21 by Sam:
2018-06-06 02:41:47 by David:
2018-06-20 09:17:39 by Simon:
2020-03-21 19:15:50 by Joshua:
2020-09-06 17:57:53 by Unicode:
2020-09-06 21:18:22 by Unicode:
2020-09-06 21:22:45 by Unicode:
2020-09-06 21:29:25 by qntm:
2020-09-07 19:42:03 by Unicode:
2020-09-09 05:02:28 by Unicode:
2022-06-14 22:45:16 by Samus Aran:
2022-06-15 19:52:06 by qntm:
2022-11-27 06:41:45 by Isaac King:
2022-12-15 01:39:27 by Adam Scherlis: