A parser is a function which takes a string (a series of tokens) as input, and returns a set of matches as output. In the most basic form, each match is simply an index, an integer marking a position further on in the string, but it could also be something more complicated, like a full parse tree.
A combinator is a higher-order function (a "functional") which takes zero or more functions (each of the same type) as input and returns a new function of the same type as output.
A parser combinator is a higher-order function which takes parsers as input and returns a new parser as output.
Suppose we had an extremely simple parser called A, designed to recognise the single character a. See what happens when we provide A with various strings as input:
A(ε) = {}
A(a) = {1}
A(ab) = {1}
A(acegi) = {1}
A(aaaa) = {1}
A(xa) = {}
(We assume that the indices of characters in the string begin at 0. For example, the string xyz has 3 characters. The character at index 0 is x, the character at index 1 is y and the character at index 2 is z. Index 3 is the end of the string, but there is no character there.)
Notice:
Because a parser returns a set, that means it can return more than one index at once. Suppose we have another parser, A2, which is able to recognise both a single a and a double aa. Now see what happens:
A2(ε) = {}
A2(a) = {1}
A2(aa) = {1, 2}
A2(aaa) = {1, 2}
A2(aaaa) = {1, 2}
Because A2 can recognise two things, it could return 1 or 2. And because it can recognise both of those things simultaneously, it returns 1 and 2 when that happens. The basic lesson here is that parsers can return multiple results, and sometimes the set of results returned can be very large, and sometimes only a few of those results are of any interest.
Okay, this is getting patronising, so let's up the pace.
After parsing that first a, there's a good chance that we want to parse a few more characters. It would be possible to take our existing string, such as acegi, and the result from A, which is {1}, and truncate everything in the string up to index 1, leaving cegi. Then we could pass cegi to a different parser. In fact we could also modify our parsers to return the truncated remainder of the string instead of just integer indices. (Some sources actually describe parsers and parser combinators in exactly these terms.)
But breaking up strings is time-consuming, and can be impractical when the strings are very long, and it's very useful to be able to remember where we were in the original string. So, a "proper" parser can be provided with both a string and an index in the string where parsing should begin. The parser then returns the indices where parsing ended, suitable for passing on to another parser. Let's let B be our parser for the letter b:
A(a, 0) = {1}
A(a, 1) = {}
A(ab, 0) = {1}
B(ab, 1) = {2}
A2(xab, 1) = {2}
A2(xaab, 1) = {2, 3}
B(xaab, 2) = {}
B(xaab, 3) = {4}
Notice a few more things.
The empty parser, let's call it E, is the one which is only capable of recognising the empty string, ε. This parser always succeeds and it always returns a set containing just one index, the index where it started:
E(abcdefghijklmno, 7) = {7}
E(a, 0) = {0}
E(a, 1) = {1}
E(ε, 0) = {0}
This is actually quite useful!
The null parser, which we can call N, is one which always returns an empty set of results, no matter what the input string is:
N(abcdefghijklmno, 7) = {}
N(a, 0) = {}
N(a, 1) = {}
N(ε, 0) = {}
This is less useful.
A parser combinator is an operator which takes parsers as input and returns a new parser as output.
The alternation of two parsers combines their results. For example, X|Y takes the results from X and from Y and combines them into a single set for returning. Suppose X can recognise the word boo and Y recognises the word boondoggle:
X(boo, 0) = {3}
Y(boo, 0) = {}
X|Y(boo, 0) = {3}
X(boondoggle, 0) = {3}
Y(boondoggle, 0) = {10}
X|Y(boondoggle, 0) = {3, 10}
We can also take the concatenation of two parsers. This feeds the results from the first parser into the input of the second, and collects all the results at the end. Remember A recognises a and B recognises b, so the concatenation AB should be able to recognise ab:
A(ab, 0) = {1}
B(ab, 1) = {2}
AB(ab, 0) = {2}
A2A2(aa, 0) = {2}
A2A2(aaa, 0) = {2, 3}
A2A2(aaaa, 0) = {2, 3, 4}
Concatenating E into other parsers is obviously pointless:
EA(a, 0) = AE(a, 0) = A(a, 0) = {1}
Notice how A2 is exactly A|AA using these rules.
Kleene star closure applies the same combinator zero or more times in a row, up to an unlimited upper bound. Let's define a combinator K as follows:
K
= E|A|AA|AAA|...
= A*
And watch carefully:
K(ε, 0) = {0}
K(a, 0) = {0, 1}
K(aa, 0) = {0, 1, 2}
K(aaa, 0) = {0, 1, 2, 3}
K(aaabcdefg, 0) = {0, 1, 2, 3}
Notice how the result from A matching an a at each index are passed back into K at each step, creating more results.
By combining parsers together using combinators, we can form quite large and complex and very advanced parsers. Suppose that Z can recognise any letter of the alphabet, and S can recognise spaces. Then ZZ* can recognise a word of one or more letters, and ZZ*SZZ* can recognise two words separated by a space, and ZZ*(SZZ*)* can recognise an unlimited sequence of words separated by spaces...
Here are the magic words. These three elementary parser combinators can be used to parse regular languages.
So far we have only seen parsers which return sets of indices. Sets of indices are useful. For example, if the string we put in has 65 characters, and the index 65 is in our result set, then our parser successfully parsed the whole string, and that's just dandy. That kind of parser is called a recogniser because it can recognise a "valid" string. And that is all it can do.
But we might also want more information, like exactly how the string was parsed. Which parsers in the combination recognised which parts of the string? Can we get a parse tree of some kind?
Easily. A parser can return the substring that it matched in addition to the index at which the match was completed. The simplest approach is to return the substring unchanged:
A(ε) = {}
A(ab, 0) = {(1, a)}
B(ab, 1) = {(2, b)}
AB(ab, 0) = {(2, ab)}
E(abcdefghijklmno, 7) = {(7, ε)}
X|Y(boondoggle, 0) = {(3, boo), (10, boondoggle)}
A2A2(aa, 0) = {(2, aa)}
A2A2(aaa, 0) = {(2, aa), (3, aaa)}
A2A2(aaaa, 0) = {(2, aa), (3, aaa), (4, aaaa)}
But this doesn't tell us anything we don't already know. We must be cleverer. This is achieved by modifying the parser combinator so that it returns a cleverer parser as its output, so that the parser itself returns a cleverer result as its output. For example, the breakdown of which parts of the string were parsed by which parser:
A(ε) = {}
A(ab, 0) = {(1, (a))}
B(ab, 1) = {(2, (b))}
AB(ab, 0) = {(2, (a, b))}
E(abcdefghijklmno, 7) = {(7, (ε))}
A2A2(aa, 0) = {(2, (a, a))}
A2A2(aaa, 0) = {(2, (a, a)), (3, (a, aa)), (3, (aa, a))}
A2A2(aaaa, 0) = {(2, (a, a)), (3, (a, aa)), (3, (aa, a)), (4, (aa, aa))}
Wait. Did you see what happened in those last two examples? See how the result sets contain both (3, (a, aa)) and (3, (aa, a))? That's because there's two possible ways for the parser A2A2 to parse the string aaa! Until we started listing the breakdowns in our results, there was no way to see this. The index 3 can't appear twice in the result set, because a set cannot contain duplicates. But (3, (a, aa)) and (3, (aa, a)) are clearly different breakdowns, so they appear separately.
There's no way to get the parse tree we want without listing these breakdowns explicitly. And there's no way to list these breakdowns without risking this duplication. Multiple possible parse trees are a natural artifact of some regular grammars. This is called ambiguity. Some regular grammars are ambiguous, and some are unambiguous. And ambiguous regular grammars can result in very, very large numbers of alternative parse trees:
A*A*A*(aaa) = { (0, (ε, ε, ε)), (1, (ε, ε, a)), (1, (ε, a, ε)), (1, (a, ε, ε)), (2, (ε, ε, aa)), (2, (ε, a, a)), (2, (a, ε, a)), (2, (ε, aa, ε)), (2, (a, a, ε)), (2, (aa, ε, ε)), (3, (ε, ε, aaa)), (3, (ε, a, aa)), (3, (a, ε, aa)), (3, (ε, aa, a)), (3, (a, a, a)), (3, (aa, ε, a)), (3, (ε, aaa, ε)), (3, (a, aa, ε)), (3, (aa, a, ε)), (3, (aaa, ε, ε)) }
That's ten parse trees for aaa, and that's not counting the intermediate parse trees from which the final ten arose, which also get returned. It can be a problem!
There are some workarounds. One is to be lazy. Instead of evaluating every possible result from every possible combinator, we can program our parser combinators to only compute and return one result (if there are any) at a time. If it turns out that the remaining results are not needed, that constitutes saved time and parse trees which never get generated. If the result is bad, we can backtrack and try other results (if there are any) in turn until all possibilities are exhausted. But I'm getting into the technicalities of recursive descent parser implementation when really I just wanted to cover the basics of combinators themselves, which I've done now.
Discussion (12)
2011-06-29 08:12:25 by Turgid:
2011-06-29 09:29:44 by qntm:
2011-06-30 01:05:02 by rkaup:
2011-07-27 23:09:59 by OvermindDL:
2011-07-27 23:12:30 by qntm:
2011-09-25 00:50:37 by qntm:
2013-05-31 15:12:18 by mindplay:
2013-06-01 10:55:09 by qntm:
2013-08-19 07:48:27 by kbackingfield:
2013-09-20 07:03:23 by fantispug:
2015-10-15 21:28:17 by Daniel H:
2018-12-26 21:18:34 by Steve Armstrong: