Turning vaguely reassuring finite-state machines into regular expressions

There's a Twitter account I like called @happyautomata, which periodically posts randomly-generated finite-state machines. Followers of this account make a hobby of turning the FSMs into regular expressions. This is always possible, because (strictly regular) regular expressions and finite-state machines are exactly equivalent to one another. And sometimes this is easy enough that you can do it in your head.

But sometimes it's very intractable, and you have to work it out by hand. But how?

Here is a worked example. Here is our original state machine:

A relatively simple four-state machine whose full structure is discussed below

Note: @happyautomata uses Firefox emojis in its images, which may disagree slightly with what you see rendered in the text below.

Note that I have added labels to the four states in our state machine: A, B, C and D. The initial state is A. Each state has transitions to other states. For example,

  • From state A, if the machine consumes a 🌹, then we jump to state B.
  • From state B, if the machine consumes a 🌡, then we jump to state B again.
  • From state D, there are no transitions.

Additionally, states with two circles are final. If, after consuming all the input, we are at state B or state D, the machine accepts the input. For example, this machine accepts the string "🌹🌼🌼", or the string "🌹", but not "🌹🌼🌳". If a transition is missing, the machine definitely doesn't accept the string. So, "🌹🌼🌹" is not accepted, and "🌡🌡🌡" definitely isn't; we just fall out of the machine if we do that.

We can represent all of this in the form of a set of simultaneous equations.

A = 🌱A | 🌹B | 🌲C | 🌡D
B = Ρ | 🌡B | 🌼C
C = 🌳C | 🌼D
D = Ξ΅

These aren't equations in the usual sense.

  • A, B, C and D aren't numbers, they are sets of strings. For example, A means "the set of accepted strings if we start at A".

  • An expression such as "🌹B" is also a set of strings: it means, "the set you get when you take every string in B and put a 🌹 on the front". An expression like "C = 🌼D" means "every string c in C has the form 🌼d for some string d in D".

  • Pipes mean alternation; you just combine the two sets.

  • Ξ΅ means the empty string. For example, because B and D are final states, the empty string is one of the strings they accept.

We need to solve for A. Let's do this from the bottom up. This technique is called the Brzozowski algebraic method. First, let's substitute the equation for D into all the earlier equations, eliminating it:

A = 🌱A | 🌹B | 🌲C | 🌡
B = Ρ | 🌡B | 🌼C
C = 🌳C | 🌼

We see for example that the string 🌡 is accepted by A.

Now let's look at the equation for C. Notice that C has a self-transition, 🌳.

C = 🌳C | 🌼

This expression means, the string 🌼 is in C, and also, for every string c in C, the string 🌳c is also in C. So, C is an infinite set of strings: 🌼, 🌳🌼, 🌳🌳🌼, 🌳🌳🌳🌼, 🌳🌳🌳🌳🌼, and so on.

Self-transitions are important and have to be dealt with first. We can do this using the Kleene star:

A = 🌱A | 🌹B | 🌲C | 🌡
B = Ρ | 🌡B | 🌼C
C = 🌳*🌼

Then, we can substitute the equation for C into all of the earlier equations as well:

A = 🌱A | 🌹B | 🌲🌳*🌼 | 🌡
B = Ρ | 🌡B | 🌼🌳*🌼

Notice how in the equations for both A and B, we have two "static" terms - pure regular expressions with no reference to other states. Let's group those together. Notice how we use the question mark metacharacter from regular expression syntax, which means "this, or the empty string":

A = 🌱A | 🌹B | 🌲🌳*🌼|🌡
B = 🌡B | (🌼🌳*🌼)?

Now let's look at the equation for B. This has a self-transition too. So let's resolve that now. (We couldn't resolve it before this.)

A = 🌱A | 🌹B | 🌲🌳*🌼|🌡
B = 🌡*(🌼🌳*🌼)?

Now let's substitute the equation for B into the equation for A:

A = 🌱A | 🌹🌡*(🌼🌳*🌼)?|🌲🌳*🌼|🌡

Finally, we can resolve that self-transition on A:

A = 🌱*(🌹🌡*(🌼🌳*🌼)?|🌲🌳*🌼|🌡)

And we're done! We have a single regular expression for A. This is a strictly regular regular expression. It features no advanced syntax such as backreferences or lookaround assertions (which are not strictly regular), and it is implicitly anchored at the start and end of the input string. This is the correct answer.

Well, it's one of the many possible correct answers. There are usually many ways to represent any particular finite-state machine as a regular expression. You can see above that there are other ways in which we could have chosen to simplify our expressions. We need not have started our substitution with D, though we do have to finish with A.

Furthermore, figuring out which of the many equivalent regular expressions is the smallest is also very computationally difficult.

Generally, regular expressions for finite-state machines can be very complex and difficult to read, even by regular expression standards.

Automating this process

A while back I created a Python package called greenery which, among other things, is designed to be able to turn finite-state machines into regular expressions. All you have to do is make sure you set up the state machine and all of its transitions correctly, manually, because it can't visually parse and interpret the delightful image up there:

from greenery import fsm, lego

plants = fsm.fsm(
    alphabet = {"🌱", "🌹", "🌡", "🌲", "🌳", "🌼"},
    states = {"A", "B", "C", "D"},
    initial = "A",
    finals = {"B", "D"},
    map = {
        "A": { "🌱": "A", "🌹": "B", "🌲": "C", "🌡": "D" },
        "B": { "🌡": "B", "🌼": "C" },
        "C": { "🌳": "C", "🌼": "D" }
    }
)

computed_regex = lego.from_fsm(plants)
print(computed_regex)

The output from this program is

🌱*((🌲|🌹🌡*🌼)🌳*🌼|🌡|🌹🌡*)

Compare with our hand-derived expression:

🌱*(🌹🌡*(🌼🌳*🌼)?|🌲🌳*🌼|🌡)

They look roughly equivalent, although it's not immediately obvious whether they actually are equivalent. Is there a way to be sure? Cunningly, there is! greenery also provides an API for determining whether two regular expressions are equivalent:

handmade_regex = lego.parse("🌱*(🌹🌡*(🌼🌳*🌼)?|🌲🌳*🌼|🌡)")
print(computed_regex.equivalent(handmade_regex))

This outputs:

True

So, there we go!

Bigger fish

Alright then, shall we try something nastier?

An alarmingly complex four-state machine whose full structure is discussed below

This is rather more complicated because when state B consumes a πŸ’™ it simultaneously transitions to both states C and D.

In effect, what this does is introduce a new "compound" state where the state machine is in both C and D at once. Let's call this new state C+D. C+D's transitions are simply C and D's transitions combined... but as we follow them, we discover more and more compound states.

This kind of thing can get very complex; theoretically, for a finite-state machine with N states and multitransitions of this kind, there can be up to 2N - 1 possible "compound" states. In this case, 4 states could have meant as many as 15 compound states after that conversion.

We also have to keep in mind that if any substate of a compound state is final, then that compound state is itself final. For example, this finite-state machine accepts the two-character string "πŸ€πŸ’™", which leaves it in state C+D, which is hence final.

The hard part, then, is actually writing the transitions out correctly.

hearts = fsm.fsm(
    alphabet = {"🀍", "πŸ’™", "❀️"},
    states = {"A", "B", "C", "D", "C+D", "A+D", "A+C", "A+B"},
    initial = "A",
    finals = {"B", "D", "C+D", "A+D", "A+B"},
    map = {
        "A": {"🀍": "B", "πŸ’™": "C"},
        "B": {"πŸ’™": "C+D", "❀️": "B"},
        "C": {"🀍": "A", "πŸ’™": "D"},
        "D": {"πŸ’™": "A"},
        "C+D": {"🀍": "A", "πŸ’™": "A+D"},
        "A+D": {"🀍": "B", "πŸ’™": "A+C"},
        "A+C": {"🀍": "A+B", "πŸ’™": "C+D"},
        "A+B": {"🀍": "B", "πŸ’™": "C+D", "❀️": "B"},
    }
)
computed_regex = lego.from_fsm(hearts)
print(computed_regex)

This outputs:

(πŸ’™(πŸ’™{2}|🀍)|🀍(❀️|πŸ’™(πŸ’™{2}🀍?πŸ’™)*πŸ’™(πŸ’™πŸ€[❀️🀍]|🀍))*πŸ’™(πŸ’™{2}🀍?πŸ’™)*🀍)*(πŸ’™{2}|🀍(❀️|πŸ’™(πŸ’™{2}🀍?πŸ’™)*πŸ’™(πŸ’™πŸ€[❀️🀍]|🀍))*(πŸ’™(πŸ’™{2}🀍?πŸ’™)*(πŸ’™(πŸ’™πŸ€)?)?)?)

Absolutely ghastly!

And what about by hand? Well, this one looks very nasty to approach by hand, but here is an equally ghastly proposed solution which may or may not have been calculated by hand. Guessing by visual inspection whether these are equivalent is likely fruitless. So let's go to the Python and see...

handmade_regex = lego.parse("(πŸ’™(🀍|πŸ’™πŸ’™)|🀍(❀️|πŸ’™(πŸ’™πŸ’™πŸ€?πŸ’™)*πŸ’™(🀍|πŸ’™πŸ€(🀍|❀️)))*πŸ’™πŸ€)*(πŸ’™πŸ’™|🀍(❀️|πŸ’™(πŸ’™πŸ’™πŸ€?πŸ’™)*πŸ’™(🀍|πŸ’™πŸ€(🀍|❀️)))*(πŸ’™(πŸ’™πŸ’™πŸ€?πŸ’™)*(πŸ’™(πŸ’™πŸ€)?)?)?)")
print(computed_regex.equivalent(handmade_regex))

This outputs:

False

Which indicates that there is a disagreement. How to find out what the discrepancy could be? Well, first we can compute the symmetric difference of these two regular expressions as a new regular expression (this is how .equivalent works internally, in fact), and then we can iterate over this new regular expression's strings:

symmetric_difference = computed_regex.symmetric_difference(handmade_regex)
string = next(symmetric_difference.strings())
print(string)
print(computed_regex.accepts(string))
print(handmade_regex.accepts(string))

This outputs:

πŸ€πŸ’™πŸ’™πŸ’™πŸ’™πŸ€πŸ€
True
False

Yes, we can see here that the string "πŸ€πŸ’™πŸ’™πŸ’™πŸ’™πŸ€πŸ€" is accepted by our computed regular expression (and by the original state machine), but, if we look closely, not by the handmade one.

In conclusion, I find all of this quite delightful and entertaining, and I hope you do too.

Discussion (8)

2020-12-06 03:49:23 by Dystopianist:

*sad non-programmer non-mathematician noises*

2020-12-06 13:04:01 by Shieldfoss:

I am suddenly reminded that it was from QNTM I actually learned regex originally. Huh. Time flies.

2020-12-06 15:37:39 by Emmett Brown:

Now that this has been added in Code, Sam has finally finished adding new entries to every section of the site!

2020-12-07 02:27:46 by Maria Szegedy:

I got a lot further than I thought I would inspecting the computed and handmade regexes to see how similar they are. A lot of it is the same up to πŸ’™{2} corresponding to πŸ’™πŸ’™ and the pipe expressions being ordered differently.

2020-12-07 08:29:39 by chridd:

I don't think combining the states in the second example is actually necessary, since regular expressions are nondeterministic; you should just be able to leave it as B = ❀️B | πŸ’™C | πŸ’™D | Ξ΅. Anyways, I got (♑❀️*πŸ’™(πŸ’™πŸ’™?|β™‘)|πŸ’™(πŸ’™πŸ’™|β™‘))*(♑❀️*(πŸ’™?πŸ’™?)|πŸ’™πŸ’™)

2021-02-13 06:28:39 by Bakkot:

I have an online tool for checking equivalence of regular expressions and finding counterexamples, in case you don't feel like firing up python: https://bakkot.github.io/dfa-lib/regeq.html (Though it doesn't support the `{2}` syntax.) It works on exactly the same principle. It was very useful for automating grading of an intro CS class when I TA'd in 2015. It's probably still used for that, come to that; I don't expect the TAs have wanted to go back to doing the equivalence checks by hand.

2021-03-18 08:45:55 by Rawling:

Codewars has 3 challenges to write regexes for "is this binary string divisible by 3/5/7", which nicely combine the state-machine-to-regex problem above with actually coming up with the state machine in the first place. I think by the time I got to 7 I'd just written a program to do it automatically - repeatedly, randomly picking how to simplify at each step to try to find the shortest overall expression, although no doubt there's a systematic way to do it.

2021-05-23 05:06:42 by kaaro:

Thank you for introducing me to competition that I did know I needed to win. See you in the reply section on Twitter.

New comment by :

Plain text only. Line breaks become <br/>
The square root of minus one: