Update 2014-09-10
Thank you for reading "Compute the intersection of two regular expressions in Python 3". This project now lives on GitHub.
I undertook a project to make it possible to compute the intersection between two regular expressions in Python 3.
Elementary automata theory tells us that the intersection of any two regular languages is a regular language, but carrying out this operation on actual regular expressions to generate a third regular expression afterwards is much harder than doing so for the other operations under which the regular languages are closed (concatenation, alternation, Kleene star closure).
For the purposes of this project I developed two modules.
fsm.py
(master), a Python 3 library for creating and manipulating finite state machines in Python.lego.py
(master), a Python 3 library for creating and manipulating regular expressions as data structures.The important features of these modules are:
lego.parse()
, which takes a regular expression from the form of a string and turns it into a lego
object, a representation of it which can be operated upon.lego.__str__()
, (written str(lego1)
) which is the reverse: it returns a string regular expression for the present lego
object.lego.fsm(alphabet)
, which turns a lego
object into a fsm
object, a finite state machine which recognises exactly the strings that the original regular expression can match.fsm.__and__()
(written result = fsm1 & fsm2
), which computes the intersection between two fsm
objects and returns a new fsm
which is only capable of recognising strings which both fsm1
and fsm2
could recognise.fsm.lego()
which uses the Brzozowski algebraic method to convert an fsm
back into a lego
object.lego.__and__()
(written using result = lego1 & lego2
), which uses the preceding three methods to compute the intersection between two lego
objects in the form of a third lego
object.>>> from greenery.lego import parse >>> print(parse("abc...") & parse("...def")) abcdef >>> print(parse("\d{4}-\d{2}-\d{2}") & parse("19.*")) 19\d\d-\d\d-\d\d >>> print(parse("\W*") & parse("[a-g0-8$%\^]+") & parse("[^d]{2,8}")) [$%\^]{2,8} >>> print(parse("[bc]*[ab]*") & parse("[ab]*[bc]*")) ([ab]*a|[bc]*c)?b* >>> print(parse("a*") & parse("b*")) >>> print(parse("a") & parse("b")) []
In the penultimate example, the empty string ""
is returned, because only the empty string is in both of the languages a*
and b*
.
In the final example, an empty character class has been returned. An empty character class can never match anything, which means that this is the smallest representation of a regular expression which matches no strings at all. (Note that this is different from only matching the empty string.)
I would provide more examples if I could think of any. To be honest I have no idea how useful this is. This was mainly (1) a toy project for me to learn Python and (2) the most algorithmically complex thing I've ever implemented which made it incredibly good fun to create. If you can get use out of it, good luck to you!
With all of these pieces in place, a full script which can compute the intersection of several regular expressions is as short as this.
The following metacharacters and formations have their usual meanings: ., *, +, ?, {m}, {m,}, {m,n}, (), |, [], ^ (within [] character ranges), - (ditto), \ to escape any of the preceding characters.
These character escapes are possible: \t, \r, \n, \f, \v.
These predefined character sets also have their usual meanings: \w, \d, \s and their negations \W, \D, \S.
In other words, the most common formations are generally okay.
By default, the program assumes that all regexes put into it are anchored at the start and end of any input string. Carets and dollar signs will be parsed as themselves.
If you want to not anchor at the start or end of the string, put .* at the start or end of your regex respectively.
This is because computing the intersection between .*a.* and .*b.* (1) is largely pointless and (2) usually results in gibberish coming out of the program.
Strictly speaking, regular-in-the-mathematical-sense languages cannot include back-references. The inclusion of back-references moves us out of the regular languages and into what Wikipedia seems to call the "language of squares". For example, the "regular expression" (.*)\1 describes the set of repeating strings: "aa", "papa", "wikiwiki", and so on. This is not a regular language. Because it is not a regular language, it does not have an equivalent finite state machine and the algorithms employed by this program cannot be applied.
In fact, "square" languages require a full-blown Turing machine to handle.
Turing machines can of course be combined, but turning a Turing machine back into a regular expression with backreferences is basically impossible.
Python's regular expression syntax is astoundingly powerful and I didn't bother to implement several of its features just to maintain my sanity.
Parentheses are used to alternate between multiple possibilities e.g. (a|bc) only. The use of parentheses to identify specific groups of characters does not alter the content of the regular language which a regex describes; it only takes effect in an actual match operation, which the program doesn't perform.
The simplest example of how placing parentheses afterwards would be impossible:
> python main.py "(ab)c" "a(bc)" abc
A regular expression describes a regular language, which is simply a list of finite strings. As with parentheses above, greedy/non-greedy operators do not actually modify the list of strings, they only come into use for actual match operations.
Discussion (14)
2010-10-10 00:18:50 by AidanDelaney:
2010-10-10 02:13:50 by JeremyBowers:
2010-10-10 02:14:59 by JeremyBowers:
2010-10-11 16:32:59 by Abdiel:
2010-10-11 18:20:07 by qntm:
2010-10-12 17:31:50 by Abdiel:
2010-10-12 17:35:24 by Abdiel:
2010-10-12 17:39:23 by qntm:
2010-10-22 18:36:06 by pozorvlak:
2011-07-13 16:33:51 by Jacopo:
2011-07-13 17:14:05 by qntm:
2012-08-04 00:58:03 by qntm:
2016-09-27 15:06:01 by Anonymous:
2018-03-10 16:49:28 by Alex: