Regular expression parsing in Python

Update 2014-09-10

Thank you for reading "Regular expression parsing in Python". This project now lives on GitHub.

This module provides methods for reading a regular expression, as provided in the form of a string, into a manipulable nested data structure, and for manipulating that data structure.

Note that this is an entirely different concept from that of simply creating and using those regexes, functionality which is present in basically every programming language in the world, Python included.

Critically, this module provides three relatively rare and advanced operations:

  • reduce(), a method for the simplification of regular expressions, and
  • __and__() (written in code as foo = bar & baz), a method for taking two regular expressions and computing their intersection. Elementary automata theory demonstrates that regular languages are indeed closed under this operation, but performing it is much more difficult in the general case than the other three operations under which regular languages are closed: concatenation, alternation and Kleene star closure. (All of these are provided, naturally.)
  • strings(), which returns a generator of all the strings that this regular expression accepts.

It's been proven that there is no canonical regular expression for any specific regular language. Many regular expressions describe the precise same language and many of those regexes are incomprehensibly long. It's also been proven that the process of simplifying a regular expression to any significant extent is seriously computationally challenging. Therefore, it is impossible for lego.py to do all of this perfectly, but it has some capabilities. Some automatic reductions that it can perform are:

  • (ab|cd|ef|)g to (ab|cd|ef)?g
  • ([ab])* to [ab]*
  • ab?b?c to ab{0,2}c
  • a(d(ab|a*c)) to ad(ab|a*c)
  • 0|[2-9] to [02-9]
  • abc|ade to a(bc|de)
  • xyz|stz to (xy|st)z

More, less obvious reductions could be bolted on to each of the various reduce() methods that are available.

Using lego.py

Use the lego.parse() function to turn a string into a lego object, which is a representation of a regular expression. From there you can use __add__() (the + operator), __or__() (the | operator) and __and__() (the & operator) to combine regular expression pieces into new ones. You may also multiply any lego object by a multiplier object, using __mul__() (the * operator), as in:

x = lego.parse("abc")
x = x * multiplier(bound(0), inf)
print(x) # "(abc)*"

Some useful available constants are:

  • the bound object inf
  • multiplier qm (multiplier(bound(0), bound(1)))
  • multiplier star (multiplier(bound(0), inf))
  • multiplier plus (multiplier(bound(1), inf))
  • the empty string emptystring
  • the character classes w, W, s, S, d, D and dot.

The lego object will maintain a reduced internal form throughout these processes.

Regex syntax

The regex syntax supported by lego.parse() is not up to the standard of sre_parse and quite a lot of the full Python regular expression syntax was omitted. Partially this was for the sake of my sanity; partially this was because some of Python's regular expression features are not, strictly speaking, regular. Nevertheless the majority of the features of a normal POSIX regex are present.

Edge cases in the pattern.parse() method

Building this parser was eye-opening because of the enormous wealth of preposterous formations which a regex can take. In general, I have aimed for rigorous simplicity. For example, a hyphen must be escaped in a character class even if it appears first or last. [-abc] is a syntax error, write [\-abc]. Likewise, escaping something which doesn't need it is a syntax error too: [\ab] resolves to neither [\\ab] nor [ab]. An empty charclass [] is legal and matches no characters: when used in a regex, the regex may match no strings.

Name

I spent a long time trying to find an appropriate metaphor for what I was trying to do: "I need an X such that lots of Xs go together to make a Y, but lots of Ys go together to make an X". Unfortunately the real world doesn't seem to be recursive in this way so I plumped for "lego" as a basic catchall term for the various components that go together to make up a data structure.

Discussion (10)

2010-08-01 20:35:27 by Ross:

So the impossible regex "a^" matches the empty string? That seems wrong.

2010-08-01 20:58:46 by frymaster:

you could rephrase as "does any valid regexp match no strings" then :P (also, when did the "prove you're human" thing change? I actually had to wiki that one...)

2010-08-01 22:00:47 by qntm:

There's two ways of looking at it. Either "^" is a special character and every string begins with it, in which case theoretically you can just have a string with a special "^" in the middle of it. Or it's a specific instruction to your regex parser and you're no longer dealing with regular-regular languages.

2010-08-02 19:52:38 by Ross:

All right, I'll buy that explanation, in which case your "regular languages" problem domain (I was going to call it a "toy" domain, but that's overstating it) becomes slightly less useful. Anchored regexes are nearly as essential to practical regex use as character classes.

2010-08-02 19:55:02 by Ross:

@frymaster: "a^" is a perfectly valid and legal regexp in every practical regular expression library you'll ever find: Perl, javascript, .NET, even Python. It just happens to match nothing, and is therefore not a useful regex, but it's valid. Another case is "$^", where you can argue about whether it should match the empty string or not.

2010-08-02 23:11:16 by qntm:

Regex anchoring is just a shortcut in any case. It's trivial to just consider all regexes to automatically anchor at the beginning and end of a string; if you want to NOT do that, add ".*" at the beginning and end. The truly irregular regex formations are things like back-references.

2010-08-08 03:22:32 by OvermindDL:

Have you ever looked at PEG expression parsing? Unlike regex it is deterministic, and parsers for it are even more simple and run a great deal faster. Boost.Spirit is a popular C++ PEG parser framework for example.

2011-09-14 13:50:39 by mattias:

What about "[^\w\W]"? i stumbled up on those kind expressions when converting regex into automatons

2011-09-14 14:00:01 by qntm:

Ooh, interesting.

2012-09-10 18:27:19 by qntm:

I've decided that [], representing an empty character class, is an ideal representation of a regular expression which matches no strings.

New comment by :

Plain text only. Line breaks become <br/>
The square root of minus one: