Parser combinators explained

A parser is a function which takes a string (a series of tokens) as input, and returns a set of matches as output. In the most basic form, each match is simply an index, an integer marking a position further on in the string, but it could also be something more complicated, like a full parse tree.

A combinator is a higher-order function (a "functional") which takes zero or more functions (each of the same type) as input and returns a new function of the same type as output.

A parser combinator is a higher-order function which takes parsers as input and returns a new parser as output.

Simple example parsers

Suppose we had an extremely simple parser called A, designed to recognise the single character a. See what happens when we provide A with various strings as input:

A(ε) = {}
A(a) = {1}
A(ab) = {1}
A(acegi) = {1}
A(aaaa) = {1}
A(xa) = {}

(We assume that the indices of characters in the string begin at 0. For example, the string xyz has 3 characters. The character at index 0 is x, the character at index 1 is y and the character at index 2 is z. Index 3 is the end of the string, but there is no character there.)

Notice:

  • A always returns a set of some kind, no matter what string is passed to it.
  • If A can't find an a at the beginning of the string passed to it, then it returns an empty set.
  • If A can can find an a at the beginning of the string passed to it, then it "consumes" that a and includes the index 1 in the set returned. This indicates that if we want to parse further characters, we should start at index 1.
  • A doesn't care if there's an a elsewhere in the string. It only cares about the first character.
  • A doesn't look for more as after finding the first one. It only cares about the first character.

Multiple results

Because a parser returns a set, that means it can return more than one index at once. Suppose we have another parser, A2, which is able to recognise both a single a and a double aa. Now see what happens:

A2(ε) = {}
A2(a) = {1}
A2(aa) = {1, 2}
A2(aaa) = {1, 2}
A2(aaaa) = {1, 2}

Because A2 can recognise two things, it could return 1 or 2. And because it can recognise both of those things simultaneously, it returns 1 and 2 when that happens. The basic lesson here is that parsers can return multiple results, and sometimes the set of results returned can be very large, and sometimes only a few of those results are of any interest.

Okay, this is getting patronising, so let's up the pace.

Offsets

After parsing that first a, there's a good chance that we want to parse a few more characters. It would be possible to take our existing string, such as acegi, and the result from A, which is {1}, and truncate everything in the string up to index 1, leaving cegi. Then we could pass cegi to a different parser. In fact we could also modify our parsers to return the truncated remainder of the string instead of just integer indices. (Some sources actually describe parsers and parser combinators in exactly these terms.)

But breaking up strings is time-consuming, and can be impractical when the strings are very long, and it's very useful to be able to remember where we were in the original string. So, a "proper" parser can be provided with both a string and an index in the string where parsing should begin. The parser then returns the indices where parsing ended, suitable for passing on to another parser. Let's let B be our parser for the letter b:

A(a, 0) = {1}
A(a, 1) = {}
A(ab, 0) = {1}
B(ab, 1) = {2}
A2(xab, 1) = {2}
A2(xaab, 1) = {2, 3}
B(xaab, 2) = {}
B(xaab, 3) = {4}

Notice a few more things.

  • Though a parser returns a set of indices as output, it only accepts a single index as input. When taking the results from A and handing them to B, it is necessary to pass one index at a time!
  • Parsers don't care which characters (x in this case) were skipped over to get to the current index.
  • Parsers don't care if there are more characters left at the end of the string. That's for us to take care of.

The empty parser, let's call it E, is the one which is only capable of recognising the empty string, ε. This parser always succeeds and it always returns a set containing just one index, the index where it started:

E(abcdefghijklmno, 7) = {7}
E(a, 0) = {0}
E(a, 1) = {1}
E(ε, 0) = {0}

This is actually quite useful!

The null parser, which we can call N, is one which always returns an empty set of results, no matter what the input string is:

N(abcdefghijklmno, 7) = {}
N(a, 0) = {}
N(a, 1) = {}
N(ε, 0) = {}

This is less useful.

Now some actual parser combinators

A parser combinator is an operator which takes parsers as input and returns a new parser as output.

Alternation

The alternation of two parsers combines their results. For example, X|Y takes the results from X and from Y and combines them into a single set for returning. Suppose X can recognise the word boo and Y recognises the word boondoggle:

X(boo, 0) = {3}
Y(boo, 0) = {}
X|Y(boo, 0) = {3}
X(boondoggle, 0) = {3}
Y(boondoggle, 0) = {10}
X|Y(boondoggle, 0) = {3, 10}

Concatenation

We can also take the concatenation of two parsers. This feeds the results from the first parser into the input of the second, and collects all the results at the end. Remember A recognises a and B recognises b, so the concatenation AB should be able to recognise ab:

A(ab, 0) = {1}
B(ab, 1) = {2}
AB(ab, 0) = {2}
A2A2(aa, 0) = {2}
A2A2(aaa, 0) = {2, 3}
A2A2(aaaa, 0) = {2, 3, 4}

Concatenating E into other parsers is obviously pointless:

EA(a, 0) = AE(a, 0) = A(a, 0) = {1}

Notice how A2 is exactly A|AA using these rules.

Kleene star closure

Kleene star closure applies the same combinator zero or more times in a row, up to an unlimited upper bound. Let's define a combinator K as follows:

K
= E|A|AA|AAA|...
= A*

And watch carefully:

K(ε, 0) = {0}
K(a, 0) = {0, 1}
K(aa, 0) = {0, 1, 2}
K(aaa, 0) = {0, 1, 2, 3}
K(aaabcdefg, 0) = {0, 1, 2, 3}

Notice how the result from A matching an a at each index are passed back into K at each step, creating more results.

By combining parsers together using combinators, we can form quite large and complex and very advanced parsers. Suppose that Z can recognise any letter of the alphabet, and S can recognise spaces. Then ZZ* can recognise a word of one or more letters, and ZZ*SZZ* can recognise two words separated by a space, and ZZ*(SZZ*)* can recognise an unlimited sequence of words separated by spaces...

So what is the use?

Here are the magic words. These three elementary parser combinators can be used to parse regular languages.

  • Every parser implicitly defines a set of strings, the strings which that parser can parse completely.
  • For the empty parser, E, that set is a singleton containing only the empty string, {ε}.
  • For static, single-character parsers like A and B, the set is a singleton, {a} and {b} respectively.
  • Those sets also happen to be regular languages (because every finite set of strings is a regular language).
  • Alternating parsers unify these sets. E.g. {a} and {b} alternated make {a, b}.
  • Concatenating parsers "append" the sets to one another. E.g. {a, b, c} and {d, e} concatenated make {ad, ae, bd, be, cd, de}.
  • The "starifying" operation ("Kleene star closure") is okay too. E.g. closure of {a} is {ε, a, aa, aaa, ...}
  • All of these rules correspond with the rules for constructing regular grammars. If X and Y define regular grammars, then so do X|Y, XY and X* and Y*.

Returning more than just indices

So far we have only seen parsers which return sets of indices. Sets of indices are useful. For example, if the string we put in has 65 characters, and the index 65 is in our result set, then our parser successfully parsed the whole string, and that's just dandy. That kind of parser is called a recogniser because it can recognise a "valid" string. And that is all it can do.

But we might also want more information, like exactly how the string was parsed. Which parsers in the combination recognised which parts of the string? Can we get a parse tree of some kind?

Easily. A parser can return the substring that it matched in addition to the index at which the match was completed. The simplest approach is to return the substring unchanged:

A(ε) = {}
A(ab, 0) = {(1, a)}
B(ab, 1) = {(2, b)}
AB(ab, 0) = {(2, ab)}
E(abcdefghijklmno, 7) = {(7, ε)}
X|Y(boondoggle, 0) = {(3, boo), (10, boondoggle)}
A2A2(aa, 0) = {(2, aa)}
A2A2(aaa, 0) = {(2, aa), (3, aaa)}
A2A2(aaaa, 0) = {(2, aa), (3, aaa), (4, aaaa)}

But this doesn't tell us anything we don't already know. We must be cleverer. This is achieved by modifying the parser combinator so that it returns a cleverer parser as its output, so that the parser itself returns a cleverer result as its output. For example, the breakdown of which parts of the string were parsed by which parser:

A(ε) = {}
A(ab, 0) = {(1, (a))}
B(ab, 1) = {(2, (b))}
AB(ab, 0) = {(2, (a, b))}
E(abcdefghijklmno, 7) = {(7, (ε))}
A2A2(aa, 0) = {(2, (a, a))}
A2A2(aaa, 0) = {(2, (a, a)), (3, (a, aa)), (3, (aa, a))}
A2A2(aaaa, 0) = {(2, (a, a)), (3, (a, aa)), (3, (aa, a)), (4, (aa, aa))}

Wait. Did you see what happened in those last two examples? See how the result sets contain both (3, (a, aa)) and (3, (aa, a))? That's because there's two possible ways for the parser A2A2 to parse the string aaa! Until we started listing the breakdowns in our results, there was no way to see this. The index 3 can't appear twice in the result set, because a set cannot contain duplicates. But (3, (a, aa)) and (3, (aa, a)) are clearly different breakdowns, so they appear separately.

There's no way to get the parse tree we want without listing these breakdowns explicitly. And there's no way to list these breakdowns without risking this duplication. Multiple possible parse trees are a natural artifact of some regular grammars. This is called ambiguity. Some regular grammars are ambiguous, and some are unambiguous. And ambiguous regular grammars can result in very, very large numbers of alternative parse trees:

A*A*A*(aaa) = { (0, (ε, ε, ε)), (1, (ε, ε, a)), (1, (ε, a, ε)), (1, (a, ε, ε)), (2, (ε, ε, aa)), (2, (ε, a, a)), (2, (a, ε, a)), (2, (ε, aa, ε)), (2, (a, a, ε)), (2, (aa, ε, ε)), (3, (ε, ε, aaa)), (3, (ε, a, aa)), (3, (a, ε, aa)), (3, (ε, aa, a)), (3, (a, a, a)), (3, (aa, ε, a)), (3, (ε, aaa, ε)), (3, (a, aa, ε)), (3, (aa, a, ε)), (3, (aaa, ε, ε)) }

That's ten parse trees for aaa, and that's not counting the intermediate parse trees from which the final ten arose, which also get returned. It can be a problem!

There are some workarounds. One is to be lazy. Instead of evaluating every possible result from every possible combinator, we can program our parser combinators to only compute and return one result (if there are any) at a time. If it turns out that the remaining results are not needed, that constitutes saved time and parse trees which never get generated. If the result is bad, we can backtrack and try other results (if there are any) in turn until all possibilities are exhausted. But I'm getting into the technicalities of recursive descent parser implementation when really I just wanted to cover the basics of combinators themselves, which I've done now.

Discussion (12)

2011-06-29 08:12:25 by Turgid:

I'm not much of a programmer or mathematician, but I read these and glean what I can. Do you write these pages as an aid to your own learning, or is there an audience out there you're trying to reach? And where do you learn about parsing anyway? By the way, I think you're missing a word here: "Wait. Did you what happened..."

2011-06-29 09:29:44 by qntm:

I'm writing these because Wikipedia's article on the subject is unsatisfactory and because I've been doing some work on parser combinators which I'm gradually leading into.

2011-06-30 01:05:02 by rkaup:

I'm loving the articles on parsers. Do you have much experience with Haskell? I was looking at the Parsec library in Haskell the other day and this looks very similar.

2011-07-27 23:09:59 by OvermindDL:

Have you looked at PEGs? Parsing Expression Grammar? There is a wikipedia article, and some currently existing libraries such as for C++ (Boost.Spirit), Erlang (Uh, something..), etc... PEGs are unambiguous and very powerful.

2011-07-27 23:12:30 by qntm:

I have, but I'm currently at a "oh please not more parsing" stage in my life.

2011-09-25 00:50:37 by qntm:

I've discovered that "A parser combinator accepts a string as input and returns a set of matches as output", currently the leading sentence of this article, is a complete lie. Will fix sometime

2013-05-31 15:12:18 by mindplay:

I struggled to understand your examples, until I found the following definition in the Wikipedia article about Parser Combinators: "a parser (combinator) is a function accepting strings as input and returning some structure as output, typically [...] a set of indices representing locations in the string where parsing stopped successfully" - figured I would post this for others, as it was crucial to my own understanding of these examples.

2013-06-01 10:55:09 by qntm:

Actually, you added the word "(combinators)" into that sentence, which turns it false. The first three sentences of this article explain the difference between a parser and a combinator and a parser combinator very clearly.

2013-08-19 07:48:27 by kbackingfield:

nice article. please extend with error messages / error handling

2013-09-20 07:03:23 by fantispug:

What is described in this article isn't how to parse Context Free Grammars, but Regular Expressions; the operations given construct precisely the Regular Expressions [the mathematical form, not the sort used in Perl and other programming languages] which is a proper subset of Context Free Grammars (for instance there is no way of expressing L={a^n b^n | n a non-negative integer} as a regular language, but it corresponds to the CFG L <- a L b | empty-string). What is done in this article is closer to "lexing" (in compiler parlance) than "parsing" (though it is a form of parsing).

2015-10-15 21:28:17 by Daniel H:

This does work for context-free languages as long as you allow recursion. Then you could define L = ALC. Without recursion (or various similar tricks), fantispug is correct that the given combinators only allow the parsing of regular languages.

2018-12-26 21:18:34 by Steve Armstrong:

@Daniel H You have to be careful adding recursion or your language quickly becomes RE.

New comment by :

Plain text only. Line breaks become <br/>
The square root of minus one: