Efficiently encoding binary data in Tweets

tl;dr: if you wish to efficiently encode binary data as Tweets,

  • don't use Base64 (210 bytes per Tweet)
  • normally use Base2048 (385 bytes per Tweet)
  • unless you can circumvent the client-side check, in which case use Base65536 (560 bytes per Tweet)

So anyway

How much binary data can you fit into the text of a Tweet?

For this piece of work I'll be ignoring the prospect of using automatically-minified URLs. Let's focus on the pure, unmodified text of the Tweet. This is a Unicode string.

Note that this string is subject to normalization when in transit through the guts of Twitter. This means that we can't just use any character we like for this purpose — such normalization may alter the text of the Tweet, which for our purposes constitutes data corruption. We need to stick with characters which do not change when normalized. For a longer discussion of this topic, see What makes a Unicode code point safe?

Additionally, the amount of data which will fit in a Tweet depends on the maximum allowable length of a Tweet, which in turn depends on how Twitter computes the length of the Tweet. There are at least three different metrics of which I am currently aware:

  • v1: maximum Tweet length is 140. Tweet length is measured in Unicode characters. Every character has length 1.
  • v2 (client-side): maximum Tweet length is 280. Tweet length is measured in Unicode characters. Every character has length 1 except for U+1100 HANGUL CHOSEONG KIYEOK upwards, which have length 2.
  • v2 (server-side): maximum Tweet length is 280. Tweet length is measured in Unicode characters. Every character has length 1.

Observing these metrics we quickly spot something interesting: Base64, the most obvious tool for any task involving transmitting binary data through a text system, does not make the most effective possible use of the range of available characters. In the first and third cases, we would be far better off making the best possible use of the entire Unicode space. In the second case, our best option is to make the best possible use of those 0x1100 = 4,352 "light" code points and use none of the remaining 1,109,760 "heavy" code points.

There are better solutions!

Comparison

Figures have been rounded down to the nearest whole byte. Or, to put it another way, if a cell says something like "17 bytes per Tweet", this means it is possible to express any byte sequence of length 0 to 17 bytes inclusive in a Tweet using this encoding. It may be possible to express some, but not all, longer byte sequences using the same encoding. Or it may not.

Encoding Implementation Bytes per Tweet
v1 v2 (client) v2 (server)
Using only lightweight characters Binary everywhere 17 35 35
Hexadecimal everywhere 70 140 140
Base64 everywhere 105 210 210
Base85 everywhere 112 224 224
Base2048 base2048 192 385 385
Using all of Unicode Base65536 base65536 et al. 280 280 560

Observations

Optimising for metric 1 is equivalent to optimising for UTF-32-encoded text. For a separate discussion of this topic, see Efficiently encoding binary data in Unicode. Base65536 was created to meet this need, in the era of shorter Tweets.

Optimising for metric 3 is equivalent to optimising for metric 1, and Base65536 is still the best choice here. However, this involves circumventing a check in Twitter's web client (and, I assume, in other clients), and it's reasonable to assume that the majority of Twitter users are unwilling or unable to circumvent this check. This brings us to metric 2, which required a new piece of work to develop Base2048.

If Twitter changes its metrics again then probably all of this work will end up thrown out, but oh well.

Discussion (6)

2017-12-18 06:51:22 by Satish:

Twitter has recently changed their maximum length from 140 to 280 characters. Probably you should update the article (double those numbers).

2017-12-18 07:10:58 by hxka:

Satish: that update is the only reason this article has been written.

2017-12-18 09:07:46 by qntm:

Satish: probably you should read it...

2017-12-18 13:04:03 by Voidhawk:

What's the largest size image twitter will transmit without compression?

2017-12-19 15:55:19 by Satish:

hxka, qntm: I indeed read the article. You say that there are three different metrics (v1, v2 server, v2 client) but certainly v1 is obsolete. So, why bother with 140 ? Is it just for completeness ? Then why not v1 server and v1 client as well -- this is not explained in TFA.

2017-12-19 15:57:09 by qntm:

In earlier versions of Twitter the client-side and server-side length checks were identical.

New comment by :

Plain text only. Line breaks become <br/>
The square root of minus one: