Update 2014-09-10
Thank you for reading "Regular expression parsing in Python". This project now lives on GitHub.
This module provides methods for reading a regular expression, as provided in the form of a string, into a manipulable nested data structure, and for manipulating that data structure.
Note that this is an entirely different concept from that of simply creating and using those regexes, functionality which is present in basically every programming language in the world, Python included.
Critically, this module provides three relatively rare and advanced operations:
reduce()
, a method for the simplification of regular expressions, and__and__()
(written in code as foo = bar & baz
), a method for taking two regular expressions and computing their intersection. Elementary automata theory demonstrates that regular languages are indeed closed under this operation, but performing it is much more difficult in the general case than the other three operations under which regular languages are closed: concatenation, alternation and Kleene star closure. (All of these are provided, naturally.)strings()
, which returns a generator of all the strings that this regular expression accepts.It's been proven that there is no canonical regular expression for any specific regular language. Many regular expressions describe the precise same language and many of those regexes are incomprehensibly long. It's also been proven that the process of simplifying a regular expression to any significant extent is seriously computationally challenging. Therefore, it is impossible for lego.py
to do all of this perfectly, but it has some capabilities. Some automatic reductions that it can perform are:
(ab|cd|ef|)g
to (ab|cd|ef)?g
([ab])*
to [ab]*
ab?b?c
to ab{0,2}c
a(d(ab|a*c))
to ad(ab|a*c)
0|[2-9]
to [02-9]
abc|ade
to a(bc|de)
xyz|stz
to (xy|st)z
More, less obvious reductions could be bolted on to each of the various reduce()
methods that are available.
Use the lego.parse()
function to turn a string into a lego
object, which is a representation of a regular expression. From there you can use __add__()
(the +
operator), __or__()
(the |
operator) and __and__()
(the &
operator) to combine regular expression pieces into new ones. You may also multiply any lego
object by a multiplier
object, using __mul__()
(the *
operator), as in:
x = lego.parse("abc") x = x * multiplier(bound(0), inf) print(x) # "(abc)*"
Some useful available constants are:
bound
object inf
qm
(multiplier(bound(0), bound(1))
)star
(multiplier(bound(0), inf)
)plus
(multiplier(bound(1), inf)
)emptystring
w
, W
, s
, S
, d
, D
and dot
.The lego
object will maintain a reduced internal form throughout these processes.
The regex syntax supported by lego.parse()
is not up to the standard of sre_parse and quite a lot of the full Python regular expression syntax was omitted. Partially this was for the sake of my sanity; partially this was because some of Python's regular expression features are not, strictly speaking, regular. Nevertheless the majority of the features of a normal POSIX regex are present.
pattern.parse()
methodBuilding this parser was eye-opening because of the enormous wealth of preposterous formations which a regex can take. In general, I have aimed for rigorous simplicity. For example, a hyphen must be escaped in a character class even if it appears first or last. [-abc]
is a syntax error, write [\-abc]
. Likewise, escaping something which doesn't need it is a syntax error too: [\ab]
resolves to neither [\\ab]
nor [ab]
. An empty charclass []
is legal and matches no characters: when used in a regex, the regex may match no strings.
I spent a long time trying to find an appropriate metaphor for what I was trying to do: "I need an X such that lots of Xs go together to make a Y, but lots of Ys go together to make an X". Unfortunately the real world doesn't seem to be recursive in this way so I plumped for "lego" as a basic catchall term for the various components that go together to make up a data structure.
Discussion (10)
2010-08-01 20:35:27 by Ross:
2010-08-01 20:58:46 by frymaster:
2010-08-01 22:00:47 by qntm:
2010-08-02 19:52:38 by Ross:
2010-08-02 19:55:02 by Ross:
2010-08-02 23:11:16 by qntm:
2010-08-08 03:22:32 by OvermindDL:
2011-09-14 13:50:39 by mattias:
2011-09-14 14:00:01 by qntm:
2012-09-10 18:27:19 by qntm: