tl;dr: if you wish to efficiently encode binary data as Tweets,
- don't use Base64 (210 bytes per Tweet)
- normally use Base2048 (385 bytes per Tweet)
- unless you can circumvent the client-side check, in which case use Base65536 (560 bytes per Tweet)
How much binary data can you fit into the text of a Tweet?
For this piece of work I'll be ignoring the prospect of using automatically-minified URLs. Let's focus on the pure, unmodified text of the Tweet. This is a Unicode string.
Note that this string is subject to normalization when in transit through the guts of Twitter. This means that we can't just use any character we like for this purpose — such normalization may alter the text of the Tweet, which for our purposes constitutes data corruption. We need to stick with characters which do not change when normalized. For a longer discussion of this topic, see What makes a Unicode code point safe?
Additionally, the amount of data which will fit in a Tweet depends on the maximum allowable length of a Tweet, which in turn depends on how Twitter computes the length of the Tweet. There are at least three different metrics of which I am currently aware:
- v1: maximum Tweet length is 140. Tweet length is measured in Unicode characters. Every character has length 1.
- v2 (client-side): maximum Tweet length is 280. Tweet length is measured in Unicode characters. Every character has length 1 except for U+1100 HANGUL CHOSEONG KIYEOK upwards, which have length 2.
- v2 (server-side): maximum Tweet length is 280. Tweet length is measured in Unicode characters. Every character has length 1.
Observing these metrics we quickly spot something interesting: Base64, the most obvious tool for any task involving transmitting binary data through a text system, does not make the most effective possible use of the range of available characters. In the first and third cases, we would be far better off making the best possible use of the entire Unicode space. In the second case, our best option is to make the best possible use of those 0x1100 = 4,352 "light" code points and use none of the remaining 1,109,760 "heavy" code points.
There are better solutions!
Figures have been rounded down to the nearest whole byte. Or, to put it another way, if a cell says something like "17 bytes per Tweet", this means it is possible to express any byte sequence of length 0 to 17 bytes inclusive in a Tweet using this encoding. It may be possible to express some, but not all, longer byte sequences using the same encoding. Or it may not.
|Encoding||Implementation||Bytes per Tweet|
|v1||v2 (client)||v2 (server)|
|Using only lightweight characters||Binary||everywhere||17||35||35|
|Using all of Unicode||Base65536||
Optimising for metric 1 is equivalent to optimising for UTF-32-encoded text. For a separate discussion of this topic, see Efficiently encoding binary data in Unicode. Base65536 was created to meet this need, in the era of shorter Tweets.
Optimising for metric 3 is equivalent to optimising for metric 1, and Base65536 is still the best choice here. However, this involves circumventing a check in Twitter's web client (and, I assume, in other clients), and it's reasonable to assume that the majority of Twitter users are unwilling or unable to circumvent this check. This brings us to metric 2, which required a new piece of work to develop Base2048.
If Twitter changes its metrics again then probably all of this work will end up thrown out, but oh well.