In part one I made a Python-based library for creating and manipulating finite state machines in Python, using fsm.py and fsmops.py.
In part two I made a Python-based library for parsing a regular expression into a data structure and manipulating them, using lego.py and legoops.py.
Most recently I posted a semi-original algorithm for converting a finite state machine back into a regular expression and promised a Python implementation. So here we are!
The FSM and regex libraries are prerequisites for these to work correctly.
mainops.py
This library contains two basic functions.
As you saw before, legoops.py contains a legobuild() function which turns a regular expression from a string into a pattern object. Here, fsmbuild() takes a pattern object and turns it into a finite state machine object.
The clever bit, however, is regexbuild() which takes a finite state machine object and, using the aforementioned algorithm, converts it back into a pattern (and then returns it as a string).
main.py
This script computes the intersection of any (*see Hints) combination of regular expressions. Sample usage:
> python main.py "[bc]*[ab]*" "[ab]*[bc]*"
([ab]*a|[bc]*c)?b*
> python main.py "abc..." "...def"
abcdef
> python main.py "\d{4}-\d{2}-\d{2}" "19.*"
19\d\d-\d\d-\d\d
> python main.py "\W*" "[a-g0-8$%\^]+" "[^d]{2,8}"
[$%\^]{2,8}
> python main.py "a*" "b*"
> python main.py "a" "b"
None
Note: in the penultimate example, the program has returned an empty string, since only the empty string "" is in both of the regular languages a* and b*. In the final example, there is no intersection at all - this is different from only containing an empty string.
This can also be used to rapidly simplify insanely complicated regexes (when such simplification is possible):
> python main.py (aa|bb*aa)a*|((ab|bb*ab)|(aa|bb*aa)a*b)((ab|bb*ab)|(aa|bb*aa)a*b)*(aa|bb*aa)a*|((ab|bb*ab)|(aa|bb*aa)a*b)((ab|bb*ab)|(aa|bb*aa)a*b)*
[ab]*a[ab]
> python main.py [ab]*a?b*|[ab]*b?a*
[ab]*
> python main.py [0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]
[\dA-Fa-f]{5}
I would provide more examples if I could think of any. To be honest I have no idea how useful this is. This was mainly (1) a toy project for me to learn Python and (2) the most algorithmically complex thing I've ever implemented which made it incredibly good fun to create. If you can get use out of it, good luck to you!
Hints and limitations
What can I use?
The following metacharacters and formations have their usual meanings: ., *, +, ?, {m}, {m,n}, (), |, [], ^ (within [] character ranges), - (ditto), \ to escape any of the preceding characters.
These character escapes are possible: \t, \r, \n, \f, \v.
These predefined character sets also have their usual meanings: \w, \d, \s and their negations \W, \D, \S. These also work when used within character ranges e.g. [^\D] will evaluate to \d.
In other words, the most common formations are generally okay.
Don't use the ^$ regex metacharacters.
By default, the program assumes that all regexes put into it are anchored at the start and end of any input string. Carets and dollar signs will be parsed as themselves.
If you want to not anchor at the start or end of the string, put .* at the start or end of your regex respectively.
This is because computing the intersection between .*a.* and .*b.* is (1) largely pointless and (2) usually results in gibberish coming out of the program.
Don't use back-references.
Strictly speaking, regular-in-the-mathematical-sense languages cannot include back-references. The inclusion of back-references moves us out of the regular languages and into what Wikipedia seems to call the "language of squares". For example, the "regular expression" (.*)\1 describes the set of repeating strings: "aa", "papa", "wikiwiki", and so on. This is not a regular language. Because it is not a regular language, it does not have an equivalent finite state machine and the algorithms employed by this program cannot be applied.
In fact, "square" languages require a full-blown Turing machine to handle.
Turing machines can of course be combined, but turning a Turing machine back into a regular expression with backreferences is basically impossible.
Don't use (?...) constructs.
Python's regular expression syntax is astoundingly powerful and I didn't bother to implement several of its features just to maintain my sanity.
Sorry, specific group references using () are not preserved.
Parentheses are used to alternate between multiple possibilities e.g. (a|bc) only. The use of parentheses to identify specific groups of characters does not alter the content of the regular language which a regex describes; it only takes effect in an actual match operation, which the program doesn't perform.
A simple example to show how placing parentheses afterwards would be impossible:
> python main.py "(ab)c" "a(bc)" abc
Don't use the greedy operators *?, +?, ??, {m,n}?.
A regular expression describes a regular language, which is simply a list of finite strings. As with parentheses above, greedy/non-greedy operators do not actually modify the list of strings, they only come into use for actual match operations.
Discussion (11)
2010-10-10 00:18:50 by AidanDelaney:
2010-10-10 02:13:50 by JeremyBowers:
2010-10-10 02:14:59 by JeremyBowers:
2010-10-11 16:32:59 by Abdiel:
2010-10-11 18:20:07 by Sam:
2010-10-12 17:31:50 by Abdiel:
2010-10-12 17:35:24 by Abdiel:
2010-10-12 17:39:23 by Sam:
2010-10-22 18:36:06 by pozorvlak:
2011-07-13 16:33:51 by Jacopo:
2011-07-13 17:14:05 by Sam:
add comment