What makes a Unicode code point safe?

Base64 is used to encode arbitrary binary data as "plain" text using a small, extremely safe repertoire of 64 (well, 65) characters. Base64 remains highly suitable to text systems where the range of characters available is very small — i.e., anything still constrained to plain ASCII. When the resulting text is output as UTF-8, Base64 encodes 3 bytes of data per 4 bytes of output, for an efficiency rating of 75%, which is about as good as it gets.

However, now that Unicode rules the world, the range of characters available to us is often significantly larger. This increases our expressive power and in many situations it increases the amount of data which can be encoded in a Unicode string. This led to the creation of:

  • Base32768, optimised for UTF-16 (93.75% efficiency vs. 37.5% for Base64),
  • Base65536, optimised for UTF-32 (50% efficiency vs. 18.75% for Base64), and
  • Base2048, optimised for Twitter (385 bytes per Tweet vs. 210 bytes per Tweet for Base64).

Choosing characters to use for these encodings might at first sound like a simple task. Naïvely, Base65536 could even be a one-liner! But in fact there are many things which can make a Unicode character unsuitable for this purpose, and we need to be very careful.

So:

What makes a Unicode character safe to use when encoding data?

The more we know about Unicode, the more complicated this question becomes.

Perhaps the best way to start to answer this question is to list the characters which would be considered unsafe, and the reasons why.

  • No unassigned (A.K.A. "reserved") code points. Unassigned code points have unpredictable, potentially undesirable characteristics (see below) if they ever do get assigned. As of Unicode 10.0, this constrains us to 276,271 code points from the full 1,114,112-code point range.

  • No non-characters, as these are reserved for internal use by Unicode.

  • No private-use characters, as these may have undesirable characteristics assigned to them by their private users.

  • No surrogates. These are intended to be used in pairs to encode non-BMP Unicode characters in UTF-16; using them as part of our encoding would probably involve using them individually, potentially raising issues if our encoded string is sent as UTF-16 to a recipient which is expecting something well-formed, or if our encoded string makes use of the actual non-BMP Unicode characters themselves.

  • No format characters. This includes zero-width spaces, soft hyphens and bidirectionality controls. These are frequently unprintable.

  • No control characters. This includes nulls, bell characters and other weird unprintable ASCII characters like U+007F DELETE. In general, anything unprintable is to be avoided.

  • We will also be avoiding control characters like tabs, carriage returns and line feeds, as well as separator characters such as spaces; in general we will avoid all whitespace characters. Whitespace may be eliminated or corrupted when the text is passed through, for example, an XML document. Also, a person trying to select that text may accidentally miss the whitespace, particularly if the whitespace is leading or trailing. Plus, it's desirable to be able to break the text up, e.g. wrapping with a line break every 80 characters, without corrupting the text's meaning. So, ideally, we should be able to ignore whitespace in the text when decoding it.

  • No punctuation characters, including hyphens, opening and closing bracket characters and initial and final quotes. This will mean that our encoded Unicode string can be safely put inside brackets or quotes if need be, without needing to be escaped, without causing ambiguity or inadvertently terminating the quoted or bracketed string.

  • No combining characters, including diacritics. These are hazardous if our encoding allows a combining character to appear first in the text. It's simpler to discard them altogether.

Note that the above constraints rule out several entire General Categories of Unicode characters: "Mark", "Punctuation", "Separator" and "Other". This leaves the General Categories "Symbol", "Number" and "Letter".

There's one other constraint, which is that characters must survive all forms of normalization.

Normalization

This final point is the most difficult to satisfy. Unicode has four "normalization forms": NFD, NFC, NFKD and NFKC. Applying any of these four normalization processes to a Unicode string can cause the sequence of code points to alter, which for our purposes constitutes data corruption. We would like our encoded data to survive all forms of normalization.

Unicode Standard Annex #15, UNICODE NORMALIZATION FORMS gives more information about this, including the following incredibly valuable facts:

What makes a code point stable?

Within the Unicode standard, every single code point has a large number of properties associated with it. Information about these properties is found in the Unicode Character Database (documentation). The machine-readable data itself is here.

One of these properties is Canonical_Combining_Class, (documentation), which explains how, if at all, the character combines with other characters. The majority of characters have a default canonical combining class of Not_Reordered (0).

Four other properties, NFD_Quick_Check, NFKD_Quick_Check, NFC_Quick_Check and NFKC_Quick_Check (data), are the "Quick Check" properties for each of the Normalization Forms. A value of "Yes" indicates that the character is unchanged by that Normalization Form.

As we see here, a code point is considered stable under a Normalization Form if it has a canonical combining class of 0 and a Quick Check value of "Yes". So all we need to do is parse this data and analyse it to get a full list of the safe code points.

What don't we care about?

  • Visible space taken up by the data on the screen. Judicious use of Zalgo-esque diacritics could serve to decrease the physical space the text takes up on the screen, to the extent that an arbitrary amount of data could be crammed into a single character. However, this comes at the expense of code point length, due to the relative scarcity of combining diacritics. It would also make the encoding more complex, and very difficult to harden against normalization.

    • One approach could be to have a single "X" with a Base1-encoded number of combining acute accents above it. E.g. X with 1,217 accents expresses the 2-byte sequence [0x03 0xc0].

  • Humans trying to write the data out by hand on paper, then input the data again. Restricting ourselves only to the characters which would survive a round trip through someone's handwriting, even Base64 would need to be cut down severely due to the visual similarities between, for example, "l", "L" and "1", "n", "u" and "r" and "o", "O" and "0". For an example of an encoding designed with this constraint in mind, see Base32.

  • Byte length in any particular encoding. This doesn't affect the "safeness" of any particular code point, although it does constrain which code points we examine.

Code

Here is a small JavaScript project, safe-code-point, which you can use to determine whether a code point is safe.

Numbers

Unicode version Code points
Assigned Safe Safe (Letter) Safe (Letter, other)
4.1.0 237,236 79,607 76,107 75,000
5.0.0 238,605 80,895 77,200 76,038
5.1.0 240,229 82,246 78,155 76,762
5.2.0 246,877 88,537 84,231 82,816
6.0.0 248,965 90,522 85,206 83,772
6.1.0 249,697 90,927 85,554 84,096
6.2.0 249,698 90,928 85,554 84,096
6.3.0 249,703 90,924 85,554 84,096
7.0.0 252,537 93,510 87,260 85,658
8.0.0 260,253 101,064 94,126 92,240
9.0.0 267,753 108,397 101,301 99,264
10.0.0 276,271 116,813 109,628 107,590
11.0.0 276,955 117,422 109,954 107,755
12.0.0 277,509 117,927 110,178 107,957
12.1.0 277,510 117,927 110,178 107,957
13.0.0 283,440 123,813 115,775 113,547

Conclusion

The set of safe code points is gradually expanding with each fresh version of the Unicode standard. As we've shown, and as we'd expect from a well-specified standard, this doesn't have detrimental effects on our existing encodings, and may even eventually enable new, more efficient ones.

Other than this, determining whether a Unicode code point is safe for use in a data encoding is essentially a solved problem at this point.

Discussion (18)

2017-11-10 15:35:38 by Sam:

(*First post!*) Thank you for doing this work. I don't need it at the moment, but knowing that you have done it, means that if I do need it, I can just refer to what you've already done. Please post more. We miss you.

2017-11-10 15:37:25 by qntrn:

PS What happened to not letting people post as Sam?

2017-11-10 16:01:02 by qntm:

Only I get the blue highlighting on my name.

2017-11-13 13:55:18 by qոtm:

I can't help but think that we're overcomplicating this. Why can't humanity find a way to transfer binary data safely?

2017-11-14 19:12:57 by Daniel:

We can and do safely transfer binary data. Usually the encoding is just to put it in a file and let the existing protocols wrap and transmit it. BaseX encoding is a workaround for channels that don't transmit arbitrary bit sequences, usually for security or technical reasons. A BaseX virus can't eat your system until someone decodes and executes it. BaseX datastreams won't confuse the channel. ex: ASCII 0000 0100 is End of Transmission. If a file contains that sequence, it could tell the receiver it's done, when it isn't.

2017-11-15 03:10:10 by hus:

Can you provide a txt file with all safe unicode characters?

2017-11-19 09:43:28 by drubber:

hard to make enough titanium punch cards, fake-qntm

2018-01-07 09:54:28 by Samech:

All this verbiage. Means this is a rudimentary programming spec. Some annotated python code and a link to github, surely.

2018-04-16 12:24:21 by Sam:

Really good post

2018-06-06 02:41:47 by David:

Aww. Unicode 11 only adds 684 characters according to the blog post.

2018-06-20 09:17:39 by Simon:

@samech youre right, but it seems as though the author intended it to be exactly that.

2020-03-21 19:15:50 by Joshua:

So, what if we didn't worry about "safe" unicode characters? In the context of a format that is unicode text-based, but not intended for direct-reading, as long as they're outside the restricted sections (such as the high/low surrogate ranges), they should be safe to transmit in most streams, what if there's just a few, like control characters we need to avoid? Even unassigned ones, they should still be able to be stored in strings and files and streams, no? How efficient an encoding could we get then?

2020-09-06 17:57:53 by Unicode:

Coulde you add unicode 13?

2020-09-06 21:18:22 by Unicode:

And who hard would it be to add "all" versions or all versions back to 3.0.? 3.0 is the first major release where nothing got removed

2020-09-06 21:22:45 by Unicode:

It would be nice to find out what is the lowest release that have enough safe chars for some base^^

2020-09-06 21:29:25 by qntm:

Those things can probably be done as long as the Unicode Consortium is making the metadata available in the same consistent format. I'll look into it.

2020-09-07 19:42:03 by Unicode:

Thx^^

2020-09-09 05:02:28 by Unicode:

throw new Error('Code point has no East_Asian_Width specified: ' + value) should be throw new Error('Code point has no East_Asian_Width specified: ' + codePoint)

New comment by :

Plain text only. Line breaks become <br/>

The square root of minus one: