Thank you for reading "Regular expression parsing in Python". This project now lives on GitHub.
This module provides methods for reading a regular expression, as provided in the form of a string, into a manipulable nested data structure, and for manipulating that data structure.
Note that this is an entirely different concept from that of simply creating and using those regexes, functionality which is present in basically every programming language in the world, Python included.
Critically, this module provides three relatively rare and advanced operations:
reduce(), a method for the simplification of regular expressions, and
__and__()(written in code as
foo = bar & baz), a method for taking two regular expressions and computing their intersection. Elementary automata theory demonstrates that regular languages are indeed closed under this operation, but performing it is much more difficult in the general case than the other three operations under which regular languages are closed: concatenation, alternation and Kleene star closure. (All of these are provided, naturally.)
strings(), which returns a generator of all the strings that this regular expression accepts.
It's been proven that there is no canonical regular expression for any specific regular language. Many regular expressions describe the precise same language and many of those regexes are incomprehensibly long. It's also been proven that the process of simplifying a regular expression to any significant extent is seriously computationally challenging. Therefore, it is impossible for
lego.py to do all of this perfectly, but it has some capabilities. Some automatic reductions that it can perform are:
More, less obvious reductions could be bolted on to each of the various
reduce() methods that are available.
lego.parse() function to turn a string into a
lego object, which is a representation of a regular expression. From there you can use
| operator) and
& operator) to combine regular expression pieces into new ones. You may also multiply any
lego object by a
multiplier object, using
* operator), as in:
x = lego.parse("abc") x = x * multiplier(bound(0), inf) print(x) # "(abc)*"
Some useful available constants are:
- the empty string
- the character classes
lego object will maintain a reduced internal form throughout these processes.
The regex syntax supported by
lego.parse() is not up to the standard of sre_parse and quite a lot of the full Python regular expression syntax was omitted. Partially this was for the sake of my sanity; partially this was because some of Python's regular expression features are not, strictly speaking, regular. Nevertheless the majority of the features of a normal POSIX regex are present.
Edge cases in the
Building this parser was eye-opening because of the enormous wealth of preposterous formations which a regex can take. In general, I have aimed for rigorous simplicity. For example, a hyphen must be escaped in a character class even if it appears first or last.
[-abc] is a syntax error, write
[\-abc]. Likewise, escaping something which doesn't need it is a syntax error too:
[\ab] resolves to neither
[ab]. An empty charclass
 is legal and matches no characters: when used in a regex, the regex may match no strings.
I spent a long time trying to find an appropriate metaphor for what I was trying to do: "I need an X such that lots of Xs go together to make a Y, but lots of Ys go together to make an X". Unfortunately the real world doesn't seem to be recursive in this way so I plumped for "lego" as a basic catchall term for the various components that go together to make up a data structure.