Compute the intersection of two regular expressions in Python 3

Update 2014-09-10

Thank you for reading "Compute the intersection of two regular expressions in Python 3". This project now lives on GitHub.

I undertook a project to make it possible to compute the intersection between two regular expressions in Python 3.

Elementary automata theory tells us that the intersection of any two regular languages is a regular language, but carrying out this operation on actual regular expressions to generate a third regular expression afterwards is much harder than doing so for the other operations under which the regular languages are closed (concatenation, alternation, Kleene star closure).

Code

For the purposes of this project I developed two modules.

  1. First I developed fsm.py (master), a Python 3 library for creating and manipulating finite state machines in Python.
  2. Next, lego.py (master), a Python 3 library for creating and manipulating regular expressions as data structures.

The important features of these modules are:

  • lego.parse(), which takes a regular expression from the form of a string and turns it into a lego object, a representation of it which can be operated upon.
  • lego.__str__(), (written str(lego1)) which is the reverse: it returns a string regular expression for the present lego object.
  • lego.fsm(alphabet), which turns a lego object into a fsm object, a finite state machine which recognises exactly the strings that the original regular expression can match.
  • fsm.__and__() (written result = fsm1 & fsm2), which computes the intersection between two fsm objects and returns a new fsm which is only capable of recognising strings which both fsm1 and fsm2 could recognise.
  • fsm.lego() which uses the Brzozowski algebraic method to convert an fsm back into a lego object.
  • lego.__and__() (written using result = lego1 & lego2), which uses the preceding three methods to compute the intersection between two lego objects in the form of a third lego object.

Usage

>>> from greenery.lego import parse
>>> print(parse("abc...") & parse("...def"))
abcdef
>>> print(parse("\d{4}-\d{2}-\d{2}") & parse("19.*"))
19\d\d-\d\d-\d\d
>>> print(parse("\W*") & parse("[a-g0-8$%\^]+") & parse("[^d]{2,8}"))
[$%\^]{2,8}
>>> print(parse("[bc]*[ab]*") & parse("[ab]*[bc]*"))
([ab]*a|[bc]*c)?b*
>>> print(parse("a*") & parse("b*"))

>>> print(parse("a") & parse("b"))
[]

In the penultimate example, the empty string "" is returned, because only the empty string is in both of the languages a* and b*.

In the final example, an empty character class has been returned. An empty character class can never match anything, which means that this is the smallest representation of a regular expression which matches no strings at all. (Note that this is different from only matching the empty string.)

I would provide more examples if I could think of any. To be honest I have no idea how useful this is. This was mainly (1) a toy project for me to learn Python and (2) the most algorithmically complex thing I've ever implemented which made it incredibly good fun to create. If you can get use out of it, good luck to you!

With all of these pieces in place, a full script which can compute the intersection of several regular expressions is as short as this.

Hints and limitations

What can I use?

The following metacharacters and formations have their usual meanings: ., *, +, ?, {m}, {m,}, {m,n}, (), |, [], ^ (within [] character ranges), - (ditto), \ to escape any of the preceding characters.

These character escapes are possible: \t, \r, \n, \f, \v.

These predefined character sets also have their usual meanings: \w, \d, \s and their negations \W, \D, \S.

In other words, the most common formations are generally okay.

Don't use the ^$ regex metacharacters.

By default, the program assumes that all regexes put into it are anchored at the start and end of any input string. Carets and dollar signs will be parsed as themselves.

If you want to not anchor at the start or end of the string, put .* at the start or end of your regex respectively.

This is because computing the intersection between .*a.* and .*b.* (1) is largely pointless and (2) usually results in gibberish coming out of the program.

Don't use back-references.

Strictly speaking, regular-in-the-mathematical-sense languages cannot include back-references. The inclusion of back-references moves us out of the regular languages and into what Wikipedia seems to call the "language of squares". For example, the "regular expression" (.*)\1 describes the set of repeating strings: "aa", "papa", "wikiwiki", and so on. This is not a regular language. Because it is not a regular language, it does not have an equivalent finite state machine and the algorithms employed by this program cannot be applied.

In fact, "square" languages require a full-blown Turing machine to handle.

Turing machines can of course be combined, but turning a Turing machine back into a regular expression with backreferences is basically impossible.

Don't use (?...) constructs.

Python's regular expression syntax is astoundingly powerful and I didn't bother to implement several of its features just to maintain my sanity.

Sorry, specific group references using () are not preserved.

Parentheses are used to alternate between multiple possibilities e.g. (a|bc) only. The use of parentheses to identify specific groups of characters does not alter the content of the regular language which a regex describes; it only takes effect in an actual match operation, which the program doesn't perform.

The simplest example of how placing parentheses afterwards would be impossible:

> python main.py "(ab)c" "a(bc)"
abc

Don't use the greedy operators *?, +?, ??, {m,n}?.

A regular expression describes a regular language, which is simply a list of finite strings. As with parentheses above, greedy/non-greedy operators do not actually modify the list of strings, they only come into use for actual match operations.

Discussion (14)

2010-10-10 00:18:50 by AidanDelaney:

You may (or may not) be interested in Mira, a Haskell regular language library I inherited from Prof. Thompson at Kent - available at http://patch-tag.com/r/balor/Mira/home. And also AMoRE, which I consider bitrotted beyond redemption - available at https://code.launchpad.net/~a-j-delaney/libamore/cunit-dev. I think this is the latest publicly available branch of AMoRE, though I haven't touched it in over a year.

2010-10-10 02:13:50 by JeremyBowers:

I think TM-recognized languages would have to be complete under intersection; the contrapositive would be that there are two Turing-recognized languages that when added to each other produce a non-Turing complete language and while I'm not sketching a full proof that does not sound like a likely outcome. Seems like (expression1)|(expression2) would do the trick; it may not be minimal but it would be correct.

2010-10-10 02:14:59 by JeremyBowers:

Sorry, I am a moron, intersection, not union. Still, obviously writing the Turing machine intersection manually would not be hard, on the level of a homework assignment.

2010-10-11 16:32:59 by Abdiel:

Turing languages are indeed closed under intersection; simulate the run of two Turing machines concurrently on two identical (in the beginning) tapes, accept only if both TMs are in an accepting state. The details of the construction will be exactly the same as in the proof that regular languages are closed under any set operation. I am sure you can find the latter anywhere on the Internet.

2010-10-11 18:20:07 by qntm:

Of course Turing-complete languages are closed under intersection. That much I already stated. My theory is that (1) regexes-plus-backreferences form a slightly weaker set than Turing-complete languages, (2) a regex-plus-backreferences can be represented using a less sophisticated automaton than a full Turing machine, and (3) these less sophisticated automata can be combined and turned back into regexes-plus-backreferences.

2010-10-12 17:31:50 by Abdiel:

A stack machine with two stacks should be able to handle backreferences. Keep all already encountered matched referenced blocks on one stack, separated by a special symbol. When you come across a reference, shuffle through the stack (using the second stack as temporary storage) and find the reference you need. Match that against the word, and shuffle the stacks back. I didn't attempt a formal proof, but I think this idea should work.

2010-10-12 17:35:24 by Abdiel:

And I just searched a bit more, and found out (and figured out at the same time) that a stack machine with two stacks is equiivalent to a Turing machine. Disregard the last comment. :-)

2010-10-12 17:39:23 by qntm:

But the question is this: can such a two-stack automaton be converted back into a regular expression complete with backreferences? And further, can two two-stack automata be combined into one two-stack automaton? The first question is really the critical point.

2010-10-22 18:36:06 by pozorvlak:

The answer to the second question is clearly "yes": since two-stack automata and Turing machines are equivalent, we can convert both 2SAs to TMs, combine the TMs into a single TM, and then convert that back into a 2SA. As to the first question, <a href="http://www.perlmonks.org/?node_id=809842">this author</a> thinks that PCRE-recognisable languages are a subset of the context-sensitive languages (and thus recognisable by linear bounded automata). It's actually pretty clear that pre-5.10, non-recursive regexes are a strict subset of the recursively enumerable languages, because they can't be used to recognise recursive languages like HTML. So no, you can't expect to convert an arbitrary 2SA into a regex-with-backrefs, even a post-Perl 5.10 recursive one.

2011-07-13 16:33:51 by Jacopo:

Thank for your solution, it is very useful to me. I was trying to compute the intersection of the following regular expressions: a.* .*b The expected result was a.*b, but the script returns a+([^a]a+|[^ab]|b)*b. I think the two results equivalent and that the difference is due to the FSA to regular expression conversion algorithm. Do you know if it is possible to simplify the script result in order to obtain the expected one?

2011-07-13 17:14:05 by qntm:

There is no canonical form for a regular expression, and the simplification of regular expressions is actually very computationally difficult as far as I am aware. Such simplification as is performed can be seen in legoops.py. The subroutines legoconcatenate(), legoalternate(), legomultiply(), _legoreduce(), _concprefix(), _concsuffix(), allSingletons(), allCharclasses(), qmIfy(), getCommonMultiplier(), unifyMultiplicands(), _leftover() and _commonmultpart() all factor into this. As you can see, the procedure is already insanely complicated and I was beginning to lose my grip on it at the time when I abandoned the task. In other words I did my best but I'm afraid I'm stumped as to how to proceed further. Anybody who wants to throw in some ideas is welcome to.

2012-08-04 00:58:03 by qntm:

After a great deal of effort I refactored almost all of lego.py so that all of the reduction routines are stored in charclass.reduce(), mult.reduce(), conc.reduce() and pattern.reduce() where they're clearly delimited and simple to understand. This is a big step up from what I had before (which included, for example, a function called "_leftovers" with the useful comment "this is basically magic"). It also means the reduction procedures are easily extensible now.

2016-09-27 15:06:01 by Anonymous:

Thanks for a great python package. It was exactly what I need for determining overlapping lex-elements.

2018-03-10 16:49:28 by Alex:

Thanks for your awesome library! print(parse(".*\*/.*").everythingbut()) This snippet outputs regex matching all strings not containing the end of comment token, which is exactly what I need!

New comment by :

Plain text only. Line breaks become <br/>
The square root of minus one: