tl;dr: if you wish to efficiently encode binary data as Tweets,
How much binary data can you fit into the text of a Tweet?
For this piece of work I'll be ignoring the prospect of using automatically-minified URLs. Let's focus on the pure, unmodified text of the Tweet. This is a Unicode string.
Note that this string is subject to normalization when in transit through the guts of Twitter. This means that we can't just use any character we like for this purpose — such normalization may alter the text of the Tweet, which for our purposes constitutes data corruption. We need to stick with characters which do not change when normalized. For a longer discussion of this topic, see What makes a Unicode code point safe?
Additionally, the amount of data which will fit in a Tweet depends on the maximum allowable length of a Tweet, which in turn depends on how Twitter computes the length of the Tweet. There are at least three different metrics of which I am currently aware:
Observing these metrics we quickly spot something interesting: Base64, the most obvious tool for any task involving transmitting binary data through a text system, does not make the most effective possible use of the range of available characters. In the first and third cases, we would be far better off making the best possible use of the entire Unicode space. In the second case, our best option is to make the best possible use of those 0x1100 = 4,352 "light" code points and use none of the remaining 1,109,760 "heavy" code points.
There are better solutions!
Figures have been rounded down to the nearest whole byte. Or, to put it another way, if a cell says something like "17 bytes per Tweet", this means it is possible to express any byte sequence of length 0 to 17 bytes inclusive in a Tweet using this encoding. It may be possible to express some, but not all, longer byte sequences using the same encoding. Or it may not.
Encoding | Implementation | Bytes per Tweet | |||
---|---|---|---|---|---|
v1 | v2 (client) | v2 (server) | |||
Using only lightweight characters | Binary | everywhere | 17 | 35 | 35 |
Hexadecimal | everywhere | 70 | 140 | 140 | |
Base64 | everywhere | 105 | 210 | 210 | |
Base85 | everywhere | 112 | 224 | 224 | |
Base2048 | base2048 |
192 | 385 | 385 | |
Using all of Unicode | Base65536 | base65536 et al. |
280 | 280 | 560 |
Optimising for metric 1 is equivalent to optimising for UTF-32-encoded text. For a separate discussion of this topic, see Efficiently encoding binary data in Unicode. Base65536 was created to meet this need, in the era of shorter Tweets.
Optimising for metric 3 is equivalent to optimising for metric 1, and Base65536 is still the best choice here. However, this involves circumventing a check in Twitter's web client (and, I assume, in other clients), and it's reasonable to assume that the majority of Twitter users are unwilling or unable to circumvent this check. This brings us to metric 2, which required a new piece of work to develop Base2048.
If Twitter changes its metrics again then probably all of this work will end up thrown out, but oh well.
Discussion (6)
2017-12-18 06:51:22 by Satish:
2017-12-18 07:10:58 by hxka:
2017-12-18 09:07:46 by qntm:
2017-12-18 13:04:03 by Voidhawk:
2017-12-19 15:55:19 by Satish:
2017-12-19 15:57:09 by qntm: