What makes a Unicode code point safe?

Base64 is used to encode arbitrary binary data as "plain" text using a small, extremely safe repertoire of 64 (well, 65) characters. Base64 remains highly suitable to text systems where the range of characters available is very small — i.e., anything still constrained to plain ASCII, and it makes just about the best possible use of that small range of characters.

However, now that Unicode rules the world, the range of characters available to us is often significantly larger. This increases our expressive power and in many situations it increases the amount of data which can be encoded in a Unicode string. This led to the creation of:

  • Base32768, optimised for UTF-16 (93.75% efficiency vs. 37.5% for Base64),
  • Base65536, optimised for UTF-32 (50% efficiency vs. 18.75% for Base64) and old-school 140-character Twitter (280 bytes per Tweet vs. 105 bytes per Tweet for Base64), and
  • Base2048, optimised for new-style 280-character Twitter (385 bytes per Tweet vs. 210 bytes per Tweet for Base64).

Choosing characters to use for these encodings might at first sound like a simple task. Naïvely, Base65536 could even be a one-liner! But in fact there are many things which can make a Unicode character unsuitable for this purpose, and we need to be very careful.

So:

What makes a Unicode character safe to use when encoding data?

The more we know about Unicode, the more complicated this question becomes.

Perhaps the best way to start to answer this question is to list the characters which would be considered unsafe, and the reasons why.

  • No unassigned (A.K.A. "reserved") code points. Unassigned code points have unpredictable, potentially undesirable characteristics (see below) if they ever do get assigned. As of Unicode 10.0, this constrains us to 276,271 code points from the full 1,114,112-code point range.

  • No non-characters, as these are reserved for internal use by Unicode.

  • No private-use characters, as these may have undesirable characteristics assigned to them by their private users.

  • No surrogates. These are intended to be used in pairs to encode non-BMP Unicode characters in UTF-16; using them as part of our encoding would probably involve using them individually, potentially raising issues if our encoded string is sent as UTF-16 to a recipient which is expecting something well-formed, or if our encoded string makes use of the actual non-BMP Unicode characters themselves.

  • No format characters. This includes zero-width spaces, soft hyphens and bidirectionality controls. These are frequently unprintable.

  • No control characters. This includes nulls, bell characters and other weird unprintable ASCII characters like U+007F DELETE. In general, anything unprintable is to be avoided.

  • We will also be avoiding control characters like tabs, carriage returns and line feeds, as well as separator characters such as spaces; in general we will avoid all whitespace characters. Whitespace may be eliminated or corrupted when the text is passed through, for example, an XML document. Also, a person trying to select that text may accidentally miss the whitespace, particularly if the whitespace is leading or trailing. Plus, it's desirable to be able to break the text up, e.g. wrapping with a line break every 80 characters, without corrupting the text's meaning. So, ideally, we should be able to ignore whitespace in the text when decoding it.

  • No punctuation characters, including hyphens, opening and closing bracket characters and initial and final quotes. This will mean that our encoded Unicode string can be safely put inside brackets or quotes if need be, without needing to be escaped, without causing ambiguity or inadvertently terminating the quoted or bracketed string.

  • No combining characters, including diacritics. These are hazardous if our encoding allows a combining character to appear first in the text. It's simpler to discard them altogether.

Note that the above constraints rule out several entire General Categories of Unicode characters: "Mark", "Punctuation", "Separator" and "Other". This leaves the General Categories "Symbol", "Number" and "Letter".

There's one other constraint, which is that characters must survive all forms of normalization.

Normalization

This final point is the most difficult to satisfy. Unicode has four "normalization forms": NFD, NFC, NFKD and NFKC. Applying any of these four normalization processes to a Unicode string can cause the sequence of code points to alter, which for our purposes constitutes data corruption. We would like our encoded data to survive all forms of normalization.

Unicode Standard Annex #15, UNICODE NORMALIZATION FORMS gives more information about this, including the following incredibly valuable facts:

What makes a code point stable?

Within the Unicode standard, every single code point has a large number of properties associated with it. Information about these properties is found in the Unicode Character Database (documentation). The machine-readable data itself is here.

One of these properties is Canonical_Combining_Class, (documentation), which explains how, if at all, the character combines with other characters. The majority of characters have a default canonical combining class of Not_Reordered (0).

Four other properties, NFD_Quick_Check, NFKD_Quick_Check, NFC_Quick_Check and NFKC_Quick_Check (data), are the "Quick Check" properties for each of the Normalization Forms. A value of "Yes" indicates that the character is unchanged by that Normalization Form.

As we see here, a code point is considered stable under a Normalization Form if it has a canonical combining class of 0 and a Quick Check value of "Yes". So all we need to do is parse this data and analyse it to get a full list of the safe code points.

What don't we care about?

  • Visible space taken up by the data on the screen. Judicious use of Zalgo-esque diacritics could serve to decrease the physical space the text takes up on the screen, to the extent that an arbitrary amount of data could be crammed into a single character. However, this comes at the expense of code point length, due to the relative scarcity of combining diacritics. It would also make the encoding more complex, and very difficult to harden against normalization.

    • One approach could be to have a single "X" with a Base1-encoded number of combining acute accents above it. E.g. X with 1,217 accents expresses the 2-byte sequence [0x03 0xc0].

  • Humans trying to write the data out by hand on paper, then input the data again. Restricting ourselves only to the characters which would survive a round trip through someone's handwriting, even Base64 would need to be cut down severely due to the visual similarities between, for example, "l", "L" and "1", "n", "u" and "r" and "o", "O" and "0". For an example of an encoding designed with this constraint in mind, see Base32.

  • Byte length in any particular encoding. This doesn't affect the "safeness" of any particular code point, although it does constrain which code points we examine.

Code

Here is a small JavaScript project, safe-code-point, which you can use to determine whether a code point is safe.

Numbers

Number of code points Unicode version
8.0 9.0 10.0
Total 1,114,112 (216 + 220)
Assigned 260,253 267,753 276,271
Safe 101,064 108,397 116,813
Safe (Letter) 94,126 101,301 109,628
Safe (Letter, other) 92,240 99,264 107,590

Conclusion

The set of safe code points is gradually expanding with each fresh version of the Unicode standard. As we've shown, and as we'd expect from a well-specified standard, this doesn't have detrimental effects on our existing encodings, and may even eventually enable new, more efficient ones.

Other than this, determining whether a Unicode code point is safe for use in a data encoding is essentially a solved problem at this point.

Back to Code
Back to Things Of Interest

Discussion (7)

2017-11-10 15:35:38 by Sam:

(*First post!*)


Thank you for doing this work. I don't need it at the moment, but knowing that you have done it, means that if I do need it, I can just refer to what you've already done.
Please post more. We miss you.

2017-11-10 15:37:25 by qntrn:

PS What happened to not letting people post as Sam?

2017-11-10 16:01:02 by qntm:

Only I get the blue highlighting on my name.

2017-11-13 13:55:18 by qոtm:

I can't help but think that we're overcomplicating this. Why can't humanity find a way to transfer binary data safely?

2017-11-14 19:12:57 by Daniel:

We can and do safely transfer binary data.
Usually the encoding is just to put it in a file and let the existing protocols wrap and transmit it.
BaseX encoding is a workaround for channels that don't transmit arbitrary bit sequences, usually for security or technical reasons.
A BaseX virus can't eat your system until someone decodes and executes it.
BaseX datastreams won't confuse the channel. ex: ASCII 0000 0100 is End of Transmission. If a file contains that sequence, it could tell the receiver it's done, when it isn't.

2017-11-15 03:10:10 by hus:

Can you provide a txt file with all safe unicode characters?

2017-11-19 09:43:28 by drubber:

hard to make enough titanium punch cards, fake-qntm