2011-06-28 by
qntm

A *parser* is a function which takes a *string* (a series of *tokens*) as input, and returns a set of *matches* as output. In the most basic form, each match is simply an index, an integer marking a position further on in the string, but it could also be something more complicated, like a full *parse tree*.

A *combinator* is a higher-order function (a "functional") which takes zero or more functions (each of the same type) as input and returns a new function of the same type as output.

A *parser combinator* is a higher-order function which takes parsers as input and returns a new parser as output.

Suppose we had an extremely simple parser called `A`, designed to recognise the single character `a`. See what happens when we provide `A` with various strings as input:

A(ε) = {}

A(a) = {1}

A(ab) = {1}

A(acegi) = {1}

A(aaaa) = {1}

A(xa) = {}

(We assume that the indices of characters in the string begin at 0. For example, the string `xyz` has 3 characters. The character at index 0 is `x`, the character at index 1 is `y` and the character at index 2 is `z`. Index 3 is the end of the string, but there is no character there.)

Notice:

`A`always returns a set of some kind, no matter what string is passed to it.- If
`A`can't find an`a`at the beginning of the string passed to it, then it returns an empty set. - If
`A`*can*can find an`a`at the beginning of the string passed to it, then it "consumes" that`a`and includes the index 1 in the set returned. This indicates that if we want to parse further characters, we should start at index 1. `A`doesn't care if there's an`a`elsewhere in the string. It only cares about the first character.`A`doesn't look for more`a`s after finding the first one. It only cares about the first character.

Because a parser returns a set, that means it can return more than one index at once. Suppose we have another parser, `A`_{2}, which is able to recognise both a single `a` *and* a double `aa`. Now see what happens:

A_{2}(ε) = {}

A_{2}(a) = {1}

A_{2}(aa) = {1, 2}

A_{2}(aaa) = {1, 2}

A_{2}(aaaa) = {1, 2}

Because `A`_{2} can recognise two things, it could return 1 or 2. And because it can recognise both of those things simultaneously, it returns 1 *and* 2 when that happens. The basic lesson here is that parsers can return multiple results, and sometimes the set of results returned can be very large, and sometimes only a few of those results are of any interest.

Okay, this is getting patronising, so let's up the pace.

After parsing that first `a`, there's a good chance that we want to parse a few more characters. It would be possible to take our existing string, such as `acegi`, and the result from `A`, which is {1}, and truncate everything in the string up to index 1, leaving `cegi`. Then we could pass `cegi` to a different parser. In fact we could also modify our parsers to return the truncated remainder of the string instead of just integer indices. (Some sources actually describe parsers and parser combinators in exactly these terms.)

But breaking up strings is time-consuming, and can be impractical when the strings are very long, and it's very useful to be able to remember where we were in the original string. So, a "proper" parser can be provided with both a string and an index in the string where parsing should begin. The parser then returns the indices where parsing ended, suitable for passing on to another parser. Let's let `B` be our parser for the letter `b`:

A(a, 0) = {1}

A(a, 1) = {}

A(ab, 0) = {1}

B(ab, 1) = {2}

A_{2}(xab, 1) = {2}

A_{2}(xaab, 1) = {2, 3}

B(xaab, 2) = {}

B(xaab, 3) = {4}

Notice a few more things.

- Though a parser returns a set of indices as output, it only accepts a single index as input. When taking the results from
`A`and handing them to`B`, it is necessary to pass one index at a time! - Parsers don't care which characters (
`x`in this case) were skipped over to get to the current index. - Parsers don't care if there are more characters left at the end of the string. That's for us to take care of.

The *empty parser*, let's call it `E`, is the one which is only capable of recognising the empty string, ε. This parser always succeeds and it always returns a set containing just one index, the index where it started:

E(abcdefghijklmno, 7) = {7}

E(a, 0) = {0}

E(a, 1) = {1}

E(ε, 0) = {0}

This is actually quite useful!

The *null parser*, which we can call `N`, is one which always returns an empty set of results, no matter what the input string is:

N(abcdefghijklmno, 7) = {}

N(a, 0) = {}

N(a, 1) = {}

N(ε, 0) = {}

This is less useful.

A parser combinator is an operator which takes parsers as input and returns a new parser as output.

The *alternation* of two parsers combines their results. For example, `X`|`Y` takes the results from `X` and from `Y` and combines them into a single set for returning. Suppose `X` can recognise the word `boo` and `Y` recognises the word `boondoggle`:

X(boo, 0) = {3}

Y(boo, 0) = {}

X|Y(boo, 0) = {3}

X(boondoggle, 0) = {3}

Y(boondoggle, 0) = {10}

X|Y(boondoggle, 0) = {3, 10}

We can also take the *concatenation* of two parsers. This feeds the results from the first parser into the input of the second, and collects all the results at the end. Remember `A` recognises `a` and `B` recognises `b`, so the concatenation `A``B` should be able to recognise `ab`:

A(ab, 0) = {1}

B(ab, 1) = {2}

AB(ab, 0) = {2}

A_{2}A_{2}(aa, 0) = {2}

A_{2}A_{2}(aaa, 0) = {2, 3}

A_{2}A_{2}(aaaa, 0) = {2, 3, 4}

Concatenating `E` into other parsers is obviously pointless:

EA(a, 0) =AE(a, 0) =A(a, 0) = {1}

Notice how `A`_{2} is exactly `A`|`A``A` using these rules.

Kleene star closure applies the same combinator zero or more times in a row, up to an unlimited upper bound. Let's define a combinator `K` as follows:

K

=E|A|AA|AAA|...

=A*

And watch carefully:

K(ε, 0) = {0}

K(a, 0) = {0, 1}

K(aa, 0) = {0, 1, 2}

K(aaa, 0) = {0, 1, 2, 3}

K(aaabcdefg, 0) = {0, 1, 2, 3}

Notice how the result from `A` matching an `a` at each index are passed back into `K` at each step, creating more results.

By combining parsers together using combinators, we can form quite large and complex and very advanced parsers. Suppose that `Z` can recognise any letter of the alphabet, and `S` can recognise spaces. Then `Z``Z`* can recognise a word of one or more letters, and `Z``Z`*`S``Z``Z`* can recognise two words separated by a space, and `Z``Z`*(`S``Z``Z`*)* can recognise an unlimited sequence of words separated by spaces...

Here are the magic words. **These three elementary parser combinators can be used to parse regular languages**.

- Every parser implicitly defines a set of strings,
*the strings which that parser can parse completely*. - For the empty parser,
`E`, that set is a singleton containing only the empty string, {ε}. - For static, single-character parsers like
`A`and`B`, the set is a singleton, {`a`} and {`b`} respectively. - Those sets also happen to be regular languages (because every finite set of strings is a regular language).
- Alternating parsers unify these sets. E.g. {
`a`} and {`b`} alternated make {`a`,`b`}. - Concatenating parsers "append" the sets to one another. E.g. {
`a`,`b`,`c`} and {`d`,`e`} concatenated make {`ad`,`ae`,`bd`,`be`,`cd`,`de`}. - The "starifying" operation ("Kleene star closure") is okay too. E.g. closure of {
`a`} is {ε,`a`,`aa`,`aaa`, ...} **All of these rules correspond with the rules for constructing regular grammars.**If`X`and`Y`define regular grammars, then so do`X`|`Y`,`X``Y`and`X`* and`Y`*.

So far we have only seen parsers which return sets of indices. Sets of indices are useful. For example, if the string we put in has 65 characters, and the index 65 is in our result set, then our parser successfully parsed the whole string, and that's just dandy. That kind of parser is called a *recogniser* because it can recognise a "valid" string. And that is all it can do.

But we might also want more information, like exactly *how* the string was parsed. Which parsers in the combination recognised which parts of the string? Can we get a parse tree of some kind?

Easily. A parser can return the substring that it matched in addition to the index at which the match was completed. The simplest approach is to return the substring unchanged:

A(ε) = {}

A(ab, 0) = {(1,a)}

B(ab, 1) = {(2,b)}

AB(ab, 0) = {(2,ab)}

E(abcdefghijklmno, 7) = {(7, ε)}

X|Y(boondoggle, 0) = {(3,boo), (10,boondoggle)}

A_{2}A_{2}(aa, 0) = {(2,aa)}

A_{2}A_{2}(aaa, 0) = {(2,aa), (3,aaa)}

A_{2}A_{2}(aaaa, 0) = {(2,aa), (3,aaa), (4,aaaa)}

But this doesn't tell us anything we don't already know. We must be cleverer. This is achieved by modifying the parser combinator so that it returns a cleverer parser as its output, so that the parser itself returns a cleverer result as *its* output. For example, the breakdown of which parts of the string were parsed by which parser:

A(ε) = {}

A(ab, 0) = {(1, (a))}

B(ab, 1) = {(2, (b))}

AB(ab, 0) = {(2, (a,b))}

E(abcdefghijklmno, 7) = {(7, (ε))}

A_{2}A_{2}(aa, 0) = {(2, (a,a))}

A_{2}A_{2}(aaa, 0) = {(2, (a,a)), (3, (a,aa)), (3, (aa,a))}

A_{2}A_{2}(aaaa, 0) = {(2, (a,a)), (3, (a,aa)), (3, (aa,a)), (4, (aa,aa))}

Wait. Did you see what happened in those last two examples? See how the result sets contain both (3, (`a`, `aa`)) and (3, (`aa`, `a`))? That's because there's two possible ways for the parser `A`_{2}`A`_{2} to parse the string `aaa`! Until we started listing the breakdowns in our results, there was no way to see this. The index 3 can't appear twice in the result set, because a set cannot contain duplicates. But (3, (`a`, `aa`)) and (3, (`aa`, `a`)) are clearly different breakdowns, so they appear separately.

There's no way to get the parse tree we want without listing these breakdowns explicitly. And there's no way to list these breakdowns without risking this duplication. Multiple possible parse trees are a natural artifact of some regular grammars. This is called *ambiguity*. Some regular grammars are ambiguous, and some are unambiguous. And ambiguous regular grammars can result in very, very large numbers of alternative parse trees:

A*A*A*(aaa) = { (0, (ε, ε, ε)), (1, (ε, ε,a)), (1, (ε,a, ε)), (1, (a, ε, ε)), (2, (ε, ε,aa)), (2, (ε,a,a)), (2, (a, ε,a)), (2, (ε,aa, ε)), (2, (a,a, ε)), (2, (aa, ε, ε)), (3, (ε, ε,aaa)), (3, (ε,a,aa)), (3, (a, ε,aa)), (3, (ε,aa,a)), (3, (a,a,a)), (3, (aa, ε,a)), (3, (ε,aaa, ε)), (3, (a,aa, ε)), (3, (aa,a, ε)), (3, (aaa, ε, ε)) }

That's ten parse trees for `aaa`, and that's not counting the intermediate parse trees from which the final ten arose, which also get returned. It can be a problem!

There are some workarounds. One is to be *lazy*. Instead of evaluating every possible result from every possible combinator, we can program our parser combinators to only compute and return one result (if there are any) at a time. If it turns out that the remaining results are not needed, that constitutes saved time and parse trees which never get generated. If the result is bad, we can backtrack and try other results (if there are any) in turn until all possibilities are exhausted. But I'm getting into the technicalities of recursive descent parser implementation when really I just wanted to cover the basics of combinators themselves, which I've done now.

## Discussion (12)

## 2011-06-29 08:12:25 by Turgid:

## 2011-06-29 09:29:44 by qntm:

## 2011-06-30 01:05:02 by rkaup:

## 2011-07-27 23:09:59 by OvermindDL:

## 2011-07-27 23:12:30 by qntm:

## 2011-09-25 00:50:37 by qntm:

## 2013-05-31 15:12:18 by mindplay:

## 2013-06-01 10:55:09 by qntm:

## 2013-08-19 07:48:27 by kbackingfield:

## 2013-09-20 07:03:23 by fantispug:

## 2015-10-15 21:28:17 by Daniel H:

## 2018-12-26 21:18:34 by Steve Armstrong: