Perform bit shuffling operations in Python

Here's another one from the "uncertain if anybody has any use for a thing like this" department.

Let's say you were trying to decode a four-byte UTF-8 character to its Unicode code point, as a 32-bit integer. You might express that conversion in the form of a neat, short string as follows:

11110aaa 10bbbbbb 10cccccc 10dddddd -> 00000000 000aaabb bbbbcccc ccdddddd

But manual bit-shuffling is tedious at the best of times. shuffle.py is a Python 3 script which takes this exact string as input and returns you some sample bit-shuffling code as output.

C:\>python shuffle.py "11110aaa 10bbbbbb 10cccccc 10dddddd -> 00000000 000aaabb bbbbcccc ccdddddd"

def shuffle(input, index):
        '''
                11110aaa 10bbbbbb 10cccccc 10dddddd
                ->
                00000000 000aaabb bbbbcccc ccdddddd
        '''
        assert index + 4 <= len(input)
        assert input[index] & 0b11111000 == 0b11110000
        assert input[index + 1] & 0b11000000 == 0b10000000
        assert input[index + 2] & 0b11000000 == 0b10000000
        assert input[index + 3] & 0b11000000 == 0b10000000
        return [
                0b00000000,
                (input[index] & 0b00000111) << 2 | (input[index + 1] & 0b00110000) >> 4,
                (input[index + 1] & 0b00001111) << 4 | (input[index + 2] & 0b00111100) >> 2,
                (input[index + 2] & 0b00000011) << 6 | input[index + 3] & 0b00111111,
        ], index + 4

The resulting sample code is perfectly valid and readable, and also ideal for refactoring into whatever purpose you like.

Possible improvements

There is a huge list of possible improvements which I may embark upon if it turns out that anybody has the slightest use for them.

Back to Code
Back to Things Of Interest
Facebook Twitter Reddit Email Hacker News StumbleUpon

Discussion (14)

2012-08-08 01:02:47 by Artanis:

Heh, irony. "LOOK OUT FOR THE MAGIC NUMBER" followed by "Output is a string expressing a function and 19 exec calls.

...

Oh god.

2012-08-08 01:03:33 by Sam:

Those exec() calls are unit tests.

2012-08-08 01:55:33 by Artanis:

Oh, I'm well aware of that.

Just, seeing exec() raises tons of red flags, because it usually means something incredibly wrong is going on. For example, using string interpolation to build a function.

2012-08-08 09:20:50 by Sam:

How else does one build a function?

2012-08-08 13:05:40 by Andrew:

I suppose the non-eval way of doing things would be to use closures.

Here's a dumb example:

https://gist.github.com/3294449

In the example, the new function has a list of bits to take from the input.

In your case you'd generate a list for each output bit that would say whether the resulting bit is 1, 0 or copied from a bit on the input. The function you return then uses this list against its input to generate the output. No eval required.

2012-08-08 14:20:33 by Sam:

That's all great, but how do you get the code of the function in the form of a string?

2012-08-08 15:01:41 by Andrew:

Brainfart!

I'd been looking at this as something you'd use to produce and run functions within a program, rather than a code generation tool.

So eval/exec is fine in this case.

I'll get my coat ...

2012-08-08 15:49:20 by ejl:

Actually, Perl has some appropriately mind-boggingly horrifying features which enable the existence of modules (e.g. Data::Dumper::Streamer) which can actually dump syntactically valid code (in plain text) that represents the exact behaviour of a sub/closure -- it's generally *almost* what you wrote in the first place. I've never used it for anything other than debugging though.

2012-08-18 12:50:07 by OvermindDL:

For note, the Erlang language, designed for network communication and distribution, has most excellent bit-twiddling features. For example, to do the above conversion of:

11110aaa 10bbbbbb 10cccccc 10dddddd -> 00000000 000aaabb bbbbcccc ccdddddd


From the ugly Python above, in Erlang would be (using the same structure as the Python code in the post):

shuffle(<<_Skipping:(Index*8),Input:32/binary,_Rest/binary>>, Index) -> {shuffle(Input), Index+4}.
shuffle(<<2#11110:5, A:3, 2#10:2, B:6, 2#10:2, C:6, 2#10:2, D:6>>) ->
<<0:11, A:3, B:6, C:6, D:6>>.

Or if you wanted to return a 32-bit integer of the Unicode value instead of a binary, replace the last line above with this for example (made for clarity):
<<Returning:32/integer-native-endian>> = <<0:11, A:3, B:6, C:6, D:6>>,
Returning.

Basically, Erlang has pattern matching, including in a function definition, not required for this example but it simplified the necessary code and is more 'Erlangish'.
The <<>> denote a binary, and in the first function call it is matching what is being passed in. The <<>> takes a list of parameters separated by commas. Each parameter is of the form Value:Size/Type, where Size is a bit size for most Types (byte for the 'binary' type as 'binary' works on bytes only) and Type defaults to integer-big-endian (perfect for this use). The Value or Size can be variables passed in and will match to bind to a variable when used in a match context, such as above. There are actually more 'pretty' and more compact ways to do this, but I wanted it to be more clear for non-erlang people. The matching contexts do the same as the asserts that the Python code had and if the, for example, first 5 bits do not match 11110 then the match context will fail, thus throwing an exception.

Different languages for different purposes. Erlang was designed for networking, as you can imagine it is quite powerful for it and especially for packet management. You could practically define an entire packet inline, even with varlength fields such as: <<Size:8,String:Size/binary,_OtherStuff/binary>>
That will parse an 8 bit size field, then parse out a string field with the length specified by the size field.

2012-08-18 12:51:01 by OvermindDL:

And apparently the post ate my whitespacing, I apologize, pretend that the code is properly formatted and spaced like I originally put...

2012-08-18 13:44:40 by OvermindDL:

Oh, and yes, Erlang has something called a Parse Transform that can mutate code at compile time, so you could quite literally make a descriptor to handle something like "11110aaa 10bbbbbb 10cccccc 10dddddd -> 00000000 000aaabb bbbbcccc ccdddddd" and have it generate valid code, all at compile time. And Erlang can also dump its code as well if debug info is enabled. But as you can see, making a parser in it would also be very simple in Erlang, I'd probably start with this (untested but should be working if not close):
"""
-module(shuffle).

-export([get_code/2]).

-record(options, {
func_name = "shuffle",
accept_index = True,
return_index = True,
doc_string = True, % Too lazy to make a doc string, but one just be an extra function
check_length = True, % Useless as it always does here since Erlang does
check_input = True, % Useless as it always does here since Erlang does
return_type = "list"}).

% Not testing that the input is well formed, although by adding a counting
% param to the parse_* functions that throws if too high would fix that easily.

get_code(InputString, Opt) ->
[Opt#options.func_name,
"(<<",
get_index_skip(Opt),
"Input:32/binary,_Rest/binary>>",
get_index_param(Opt),
") ->~n ",
get_return_val(Opt),
Opt#options.func_name,
"(<<",
parse_input(InputString)]

get_index_skip(#options{accept_index=True}) -> "_Skipping:(Index*8),";
get_index_skip(#options{accept_index=False}) -> "".

get_index_param(#options{accept_index=True}) -> ", Index";
get_index_param(#options{accept_index=False}) -> "".

get_return_val(#options{accept_index=True, return_index=True}) -> "{shuffle(Input), Index+4}.~n";
get_return_val(#options{accept_index=False}) -> "shuffle(Input).~n";
get_return_val(#options{return_index=False}) -> "shuffle(Input).~n".

parse_input([$\s |In]) -> parse_input(In); % Skip spaces!
parse_input([V |In]) when V>=$0 andalso V<=$9 -> ["2#", V, parse_input(In)];
parse_input([V |In]) when V>=$a andalso V<=$z -> parse_variable(In, V, 1);
parse_input([$-, $> |In]) -> [">>) ->~n <<", parse_output(In), ">>."].

parse_variable([Var|In], Var, Count) -> parse_variable(In, Var, Count+1);
parse_variable(In, Var, Count) -> [Var, ":", Count].

parse_output([$\s |In]) -> parse_output(In); % Skip spaces!
parse_input([V |In]) when V>=$0 andalso V<=$9 -> ["2#", V, parse_output(In)];
parse_input([V |In]) when V>=$a andalso V<=$z -> parse_variable(In, V, 1);
parse_input([]) -> [].
"""

This should return a string (in Erlang a nested list of lists and string and binaries can be flattened to a single string by any and every IO function) that contains the code to parse the input, code like what I posted in my first comment. Just in case it gets too mangled here, I also put it on http://ideone.com/SlfkH but it is still untested right now, if something is wrong it would be an easy fix though, it is a simple language, pattern matching is an absolute bliss in coding, and this language well outperforms Python.

2012-08-18 13:46:43 by OvermindDL:

Er, wrong link, this one: http://ideone.com/7rz9A

2012-08-18 13:48:50 by OvermindDL:

I apologize for the comment count, lack of edit capabilities...

For note, Erlang also has a few libraries (such as PropEr) that you tell it a basis matching of types of inputs to a function to how output should match and it can generate test cases run thousands of them with input you never would have thought of, all in an attempt to break it. It is quite thorough and exceedingly useful on functional code such as this.

2012-08-18 13:51:03 by OvermindDL:

Er, also just realized I hardcoded mine to a length of 4 bytes as that is the problem I was trying to solve, it would be easy to make it arbitrary length however (at a bit level even, not just a byte level), mostly by just changing the get_code function and adding a counting length param in the rest of the functions.

(Really want an edit ability...)

add comment