Regular expression parsing in Python

Previously.

Okay, part two of three. The aim with this collection of modules was provide a way to read a regular expression, as provided in the form of a string, into a manipulable nested data structure, and provide various methods for manipulating that data structure while keeping the structure as simple as possible. Note that this is an entirely different concept from that of simply creating and using those regexes, functionality which is present in basically every programming language in the world, Python included.

It's been proven that there is no canonical regular expression for any specific regular language. Many regular expressions describe the precise same language and many of those regexes are incomprehensibly long. It's also been proven that the process of simplifying a regular expression to any significant extent is seriously computationally challenging. In fact, it's a really interesting subject to me because you can start with some very basic methods and continually build up really complex pattern recognition routines without negating anything that went below. You can bolt on more clever stuff.

(Once again, the standard disclaimer: I'm certain that this has already been done, inside the guts of Python if nowhere else. I did it anyway because it was a learning experience and apparently I just have a thing for writing parsers.)

lego.py

I spent a long time trying to find an appropriate metaphor for what I was trying to do: "I need an X such that lots of Xs go together to make a Y, but lots of Ys go together to make an X". Unfortunately the real world doesn't seem to be recursive in this way so I plumped for "lego" as a basic catchall term for the various components that go together to make up a data structure.

charclass is the most interesting of these "lego bricks". In a regex, a character class is either a single character or a range of characters wrapped in square brackets. The interesting thing here is that a character class can be negated, e.g. [^abc] means "any character other than a, b or c". The use of a set or frozenset for this purpose was obvious, and if I was working with a global character set of, say, 256 single-byte characters, then I could simply have charclass interpret [^abc] as the 253-character set of bytes which aren't a, b or c. Unfortunately this is already pretty unwieldy, and the fact that I'm working in Python means that the actual character set in question is Unicode, which contains 1,114,112 distinct characters.

To work around this I included a "negated" flag. [^abc] is created using charclass("abc", negateMe=True). It has three elements, but is marked as a negated class. Applying conventional set operations to a charclass can have undesirable results, but more on this shortly.

To express the dot metacharacter, ., simply create a negated empty charclass: charclass(negateMe=True).

mult is a class combining a multiplicand (a charclass) with a multiplier consisting of a finite minimum and a finite maximum (or None for infinity). This permits the creation of things like a*, b{2,5} and c which is implicitly c{1,1}.

conc permits the concatenation of arbitrarily many mults. For example, a*b{2,5}[^def]+g? is a conc. To express an empty string, use an empty conc: conc(()).

Finally, a full pattern permits the alternation between arbitrarily many concs. In a regex, alternation is represented using a pipe, |. So, abc|def would be an alt consisting of two concs: abc and def. Since regular expressions may nest inside one another in the form of subpatterns, a pattern can also be used as the multiplicand in a mult, instead of a charclass. For example, (abc|def)*.

All four "lego bricks" have a getString() method, making them printable. There are also some useful constants such as the pre-built \w, \d and \s character classes.

legoops.py

As mentioned above, once negated character classes are thrown into the mix, carrying out simple set operations like union on character classes becomes non-trivial. For example, unifying [^abc] and [^cde] will result in a set containing abcde whereas the desired result would be a negated charclass, [^c]. Such things become quite important when users start using ridiculous yet legal regex strings like [^\W\D].

To this end, legoops.py contains the following functions for carrying out these operations on possibly-negated character classes: classnegate(), classunion(), classdifference(), classintersection() and classissubset(). There are unit tests for all of these functions. In fact, unlike my finite state machine work this library is very extensively unit tested.

Three more functions, legoconcatenate(), legoalternate() and legomultiply(), provide capability to take any arbitrary collections of "lego bricks" and concatenate, alternate between, or multiply them relatively intelligently, without resulting in data structures of unnecessary complexity. For example, concatenating a and a* should result in a+, not aa*; alternating between a and ab should result in ab?, not a|ab, and multiplying a+ by {2,3} should result in a{2,}, not (a+){2,3}. As mentioned above, these functions can be arbitraily complex and advanced, and are limited only by computational power and programmer willpower. I have a few improvements which I may one day make but I've lost enthusiasm after getting to what I feel is the "good enough" stage. These functions are already very sophisticated in my opinion, making use of several highly complex helper functions. Again, almost all of this is thankfully unit tested. The input to these functions need NOT be full-fledged patterns: concs, mults and charclasses are also acceptable. The results will also be reduced as far as possible. For example, if the result is a then that will be a charclass, nothing more complex than necessary.

Finally there is the legobuild() function, which constructs an initial pattern object from an arbitrary initial string. This is not up to the standard of sre_parse and quite a lot of the full Python regular expression syntax was omitted for the sake of my sanity and/or because this particular project did not require it. Nevertheless the majority of the features of a normal POSIX regex are present. Building this parser was eye-opening because of the enormous wealth of preposterous formations which a regex can take. For example, if a ] is the first character in a regex, is that unambiguously a "right bracket" character, or does it occur as a "close character class" character and raise a syntax error? What happens when people backslash-escape things that technically don't need to be backslash-escaped? Does the backslash get counted as a character, or ignored, or is that a syntax error? How do you do a character class containing only a caret? Can you have an empty character class? (Answer: no, you can't.) Is it possible for a regular expression to match nothing? (Answer: no. Even the empty regex, "", matches one string, the empty string. There is no regex which matches no strings.)

Back to Code
Back to Things Of Interest
StumbleUpon Twitter Hacker News Facebook Reddit Digg del.icio.us Email

Discussion (9)

2010-08-01 20:35:27 by Ross:

So the impossible regex "a^" matches the empty string? That seems wrong.

2010-08-01 20:58:46 by frymaster:

you could rephrase as "does any valid regexp match no strings" then :P

(also, when did the "prove you're human" thing change? I actually had to wiki that one...)

2010-08-01 22:00:47 by Sam:

There's two ways of looking at it. Either "^" is a special character and every string begins with it, in which case theoretically you can just have a string with a special "^" in the middle of it. Or it's a specific instruction to your regex parser and you're no longer dealing with regular-regular languages.

2010-08-02 19:52:38 by Ross:

All right, I'll buy that explanation, in which case your "regular languages" problem domain (I was going to call it a "toy" domain, but that's overstating it) becomes slightly less useful. Anchored regexes are nearly as essential to practical regex use as character classes.

2010-08-02 19:55:02 by Ross:

@frymaster: "a^" is a perfectly valid and legal regexp in every practical regular expression library you'll ever find: Perl, javascript, .NET, even Python. It just happens to match nothing, and is therefore not a useful regex, but it's valid.

Another case is "$^", where you can argue about whether it should match the empty string or not.

2010-08-02 23:11:16 by Sam:

Regex anchoring is just a shortcut in any case. It's trivial to just consider all regexes to automatically anchor at the beginning and end of a string; if you want to NOT do that, add ".*" at the beginning and end.

The truly irregular regex formations are things like back-references.

2010-08-08 03:22:32 by OvermindDL:

Have you ever looked at PEG expression parsing? Unlike regex it is deterministic, and parsers for it are even more simple and run a great deal faster. Boost.Spirit is a popular C++ PEG parser framework for example.

2011-09-14 13:50:39 by mattias:

What about "[^\w\W]"? i stumbled up on those kind expressions when converting regex into automatons

2011-09-14 14:00:01 by Sam:

Ooh, interesting.

add comment