There's a Twitter account I like called @happyautomata, which periodically posts randomly-generated finite-state machines. Followers of this account make a hobby of turning the FSMs into regular expressions. This is always possible, because (strictly regular) regular expressions and finite-state machines are exactly equivalent to one another. And sometimes this is easy enough that you can do it in your head.
But sometimes it's very intractable, and you have to work it out by hand. But how?
Here is a worked example. Here is our original state machine:
Note: @happyautomata uses Firefox emojis in its images, which may disagree slightly with what you see rendered in the text below.
Note that I have added labels to the four states in our state machine: A, B, C and D. The initial state is A. Each state has transitions to other states. For example,
Additionally, states with two circles are final. If, after consuming all the input, we are at state B or state D, the machine accepts the input. For example, this machine accepts the string "πΉπΌπΌ", or the string "πΉ", but not "πΉπΌπ³". If a transition is missing, the machine definitely doesn't accept the string. So, "πΉπΌπΉ" is not accepted, and "π΅π΅π΅" definitely isn't; we just fall out of the machine if we do that.
We can represent all of this in the form of a set of simultaneous equations.
A = π±A | πΉB | π²C | π΅D
B = Ξ΅ | π΅B | πΌC
C = π³C | πΌD
D = Ξ΅
These aren't equations in the usual sense.
A, B, C and D aren't numbers, they are sets of strings. For example, A means "the set of accepted strings if we start at A".
An expression such as "πΉB" is also a set of strings: it means, "the set you get when you take every string in B and put a πΉ on the front". An expression like "C = πΌD" means "every string c in C has the form πΌd for some string d in D".
Pipes mean alternation; you just combine the two sets.
Ξ΅ means the empty string. For example, because B and D are final states, the empty string is one of the strings they accept.
We need to solve for A. Let's do this from the bottom up. This technique is called the Brzozowski algebraic method. First, let's substitute the equation for D into all the earlier equations, eliminating it:
A = π±A | πΉB | π²C | π΅
B = Ξ΅ | π΅B | πΌC
C = π³C | πΌ
We see for example that the string π΅ is accepted by A.
Now let's look at the equation for C. Notice that C has a self-transition, π³.
C = π³C | πΌ
This expression means, the string πΌ is in C, and also, for every string c in C, the string π³c is also in C. So, C is an infinite set of strings: πΌ, π³πΌ, π³π³πΌ, π³π³π³πΌ, π³π³π³π³πΌ, and so on.
Self-transitions are important and have to be dealt with first. We can do this using the Kleene star:
A = π±A | πΉB | π²C | π΅
B = Ξ΅ | π΅B | πΌC
C = π³*πΌ
Then, we can substitute the equation for C into all of the earlier equations as well:
A = π±A | πΉB | π²π³*πΌ | π΅
B = Ξ΅ | π΅B | πΌπ³*πΌ
Notice how in the equations for both A and B, we have two "static" terms - pure regular expressions with no reference to other states. Let's group those together. Notice how we use the question mark metacharacter from regular expression syntax, which means "this, or the empty string":
A = π±A | πΉB | π²π³*πΌ|π΅
B = π΅B | (πΌπ³*πΌ)?
Now let's look at the equation for B. This has a self-transition too. So let's resolve that now. (We couldn't resolve it before this.)
A = π±A | πΉB | π²π³*πΌ|π΅
B = π΅*(πΌπ³*πΌ)?
Now let's substitute the equation for B into the equation for A:
A = π±A | πΉπ΅*(πΌπ³*πΌ)?|π²π³*πΌ|π΅
Finally, we can resolve that self-transition on A:
A = π±*(πΉπ΅*(πΌπ³*πΌ)?|π²π³*πΌ|π΅)
And we're done! We have a single regular expression for A. This is a strictly regular regular expression. It features no advanced syntax such as backreferences or lookaround assertions (which are not strictly regular), and it is implicitly anchored at the start and end of the input string. This is the correct answer.
Well, it's one of the many possible correct answers. There are usually many ways to represent any particular finite-state machine as a regular expression. You can see above that there are other ways in which we could have chosen to simplify our expressions. We need not have started our substitution with D, though we do have to finish with A.
Furthermore, figuring out which of the many equivalent regular expressions is the smallest is also very computationally difficult.
Generally, regular expressions for finite-state machines can be very complex and difficult to read, even by regular expression standards.
A while back I created a Python package called greenery
which, among other things, is designed to be able to turn finite-state machines into regular expressions. All you have to do is make sure you set up the state machine and all of its transitions correctly, manually, because it can't visually parse and interpret the delightful image up there:
from greenery import fsm, lego plants = fsm.fsm( alphabet = {"π±", "πΉ", "π΅", "π²", "π³", "πΌ"}, states = {"A", "B", "C", "D"}, initial = "A", finals = {"B", "D"}, map = { "A": { "π±": "A", "πΉ": "B", "π²": "C", "π΅": "D" }, "B": { "π΅": "B", "πΌ": "C" }, "C": { "π³": "C", "πΌ": "D" } } ) computed_regex = lego.from_fsm(plants) print(computed_regex)
The output from this program is
π±*((π²|πΉπ΅*πΌ)π³*πΌ|π΅|πΉπ΅*)
Compare with our hand-derived expression:
π±*(πΉπ΅*(πΌπ³*πΌ)?|π²π³*πΌ|π΅)
They look roughly equivalent, although it's not immediately obvious whether they actually are equivalent. Is there a way to be sure? Cunningly, there is! greenery
also provides an API for determining whether two regular expressions are equivalent:
handmade_regex = lego.parse("π±*(πΉπ΅*(πΌπ³*πΌ)?|π²π³*πΌ|π΅)") print(computed_regex.equivalent(handmade_regex))
This outputs:
True
So, there we go!
Alright then, shall we try something nastier?
This is rather more complicated because when state B consumes a π it simultaneously transitions to both states C and D.
In effect, what this does is introduce a new "compound" state where the state machine is in both C and D at once. Let's call this new state C+D. C+D's transitions are simply C and D's transitions combined... but as we follow them, we discover more and more compound states.
This kind of thing can get very complex; theoretically, for a finite-state machine with N states and multitransitions of this kind, there can be up to 2N - 1 possible "compound" states. In this case, 4 states could have meant as many as 15 compound states after that conversion.
We also have to keep in mind that if any substate of a compound state is final, then that compound state is itself final. For example, this finite-state machine accepts the two-character string "π€π", which leaves it in state C+D, which is hence final.
The hard part, then, is actually writing the transitions out correctly.
hearts = fsm.fsm( alphabet = {"π€", "π", "β€οΈ"}, states = {"A", "B", "C", "D", "C+D", "A+D", "A+C", "A+B"}, initial = "A", finals = {"B", "D", "C+D", "A+D", "A+B"}, map = { "A": {"π€": "B", "π": "C"}, "B": {"π": "C+D", "β€οΈ": "B"}, "C": {"π€": "A", "π": "D"}, "D": {"π": "A"}, "C+D": {"π€": "A", "π": "A+D"}, "A+D": {"π€": "B", "π": "A+C"}, "A+C": {"π€": "A+B", "π": "C+D"}, "A+B": {"π€": "B", "π": "C+D", "β€οΈ": "B"}, } ) computed_regex = lego.from_fsm(hearts) print(computed_regex)
This outputs:
(π(π{2}|π€)|π€(β€οΈ|π(π{2}π€?π)*π(ππ€[β€οΈπ€]|π€))*π(π{2}π€?π)*π€)*(π{2}|π€(β€οΈ|π(π{2}π€?π)*π(ππ€[β€οΈπ€]|π€))*(π(π{2}π€?π)*(π(ππ€)?)?)?)
Absolutely ghastly!
And what about by hand? Well, this one looks very nasty to approach by hand, but here is an equally ghastly proposed solution which may or may not have been calculated by hand. Guessing by visual inspection whether these are equivalent is likely fruitless. So let's go to the Python and see...
handmade_regex = lego.parse("(π(π€|ππ)|π€(β€οΈ|π(πππ€?π)*π(π€|ππ€(π€|β€οΈ)))*ππ€)*(ππ|π€(β€οΈ|π(πππ€?π)*π(π€|ππ€(π€|β€οΈ)))*(π(πππ€?π)*(π(ππ€)?)?)?)") print(computed_regex.equivalent(handmade_regex))
This outputs:
False
Which indicates that there is a disagreement. How to find out what the discrepancy could be? Well, first we can compute the symmetric difference of these two regular expressions as a new regular expression (this is how .equivalent
works internally, in fact), and then we can iterate over this new regular expression's strings:
symmetric_difference = computed_regex.symmetric_difference(handmade_regex) string = next(symmetric_difference.strings()) print(string) print(computed_regex.accepts(string)) print(handmade_regex.accepts(string))
This outputs:
π€πππππ€π€ True False
Yes, we can see here that the string "π€πππππ€π€" is accepted by our computed regular expression (and by the original state machine), but, if we look closely, not by the handmade one.
In conclusion, I find all of this quite delightful and entertaining, and I hope you do too.
Discussion (8)
2020-12-06 03:49:23 by Dystopianist:
2020-12-06 13:04:01 by Shieldfoss:
2020-12-06 15:37:39 by Emmett Brown:
2020-12-07 02:27:46 by Maria Szegedy:
2020-12-07 08:29:39 by chridd:
2021-02-13 06:28:39 by Bakkot:
2021-03-18 08:45:55 by Rawling:
2021-05-23 05:06:42 by kaaro: