Loco: parser combinators for PHP

Loco is a library of parser combinators for PHP.

Loco uses single-valued parsers called MonoParsers. A conventional, "enthusiastic" parser returns a set of possible results (which is empty if parsing is not possible). A "lazy" parser returns one possible result on the first call, and then returns further results with each subsequent call until no more are possible. MonoParsers simply return a single result or failure. By explicitly disallowing backtracking, this keeps parsing time linear (unless you use advanced, non-regular "regular expressions", but that's crazy talk).

Examples

More examples

For examples of using Loco to recognise grammar syntax notations such as Backus-Naur Form, see More Loco examples.

Great things about Loco

  • Parse in linear time
  • Simple, logical, coherent syntax
  • Compact syntax
  • No explicit lexing step, because parser combinators can do everything really
  • Detects infinite loops (e.g. (|a)*) at grammar creation time
  • Detects left-recursion (e.g. A -> Aa) at grammar creation time

How to use Loco

One of the problems of using parser combinators to build top-down, recursive descent parsers and using those parsers to parse stuff is ambiguity. Naïve combinatory parsing requires exponential time and space when parsing an ambiguous context-free grammar. This is because of the backtracking which is required when multiple forward paths are possible. You can spend an unlimited amount of time investigating what turns out to be a fruitless branch of the space of all possible parse trees, and an unlimited amount of time backtracking and trying more possibilities until (1) you reach the full tree that you need or (2) all possibilities have been exhausted and failure has occurred. This happens even in a "lazy parsing" situation where we don't actually generate all possible parse trees simultaneously, only one at a time as needed.

The simple and obvious solution to this problem is to use unambiguous context-free grammars. Now, there's no algorithm by which it can be determined whether an arbitrary context-free grammar is unambiguous. However, there are imperfect algorithms which at least guarantee no false positives. That is, if they report "unambiguous", then the context-free grammar is definitely unambiguous. (There will be false negatives; the algorithm will sometimes report "ambiguous" for context-free grammars which are unambiguous. But this is unavoidable, as we just noted.) And there are procedures for building context-free grammars which are guaranteed to be unambiguous.

An unambiguous parser is one which returns at most 1 result. Combining unambiguous parsers together in unambiguous ways, we can create full parsers which, again, return at most 1 result, or failure. Because we can guarantee unambiguity we have no need to backtrack. And because we have no need to backtrack, we can get our results linear time.

That's why Loco solely uses unambiguous parsers. Here's a full list of what's available:

Parsers in Loco

MonoParser

Abstract base class from which all parsers inherit. Can't be instantiated. "Mono" means the parser returns one result, or fails.

Parser has one important method, match($string, $i = 0), which either returns the successful match in the form of an array("j" => 9, "value" => "something"), or throws a ParseFailureException.

There is also the more useful method parse($string), which either returns the parsed value "something" or throws a ParseFailureException if the match fails or doesn't occupy the entire length of the supplied string.

EmptyParser

Matches the empty string. Works in constant time.

Callback is passed no arguments. Default callback returns null.

  new EmptyParser();
  // returns null

  new EmptyParser(
    function() { return array(); }
  );
  // return an empty array instead

StringParser

Matches a static string. This works in linear time, O(n) where n is the length of the string.

Callback is passed one argument, the string that was matched. Yes, that's effectively the same function call each time. Default callback returns the first argument i.e. the string.

  new StringParser("name");
  // returns "name"

  new StringParser(
    "name",
    function($string) { return strrev($string); }
  );
  // returns "eman"

RegexParser

Matches a regular expression. Technically, provided the expression in question is truly regular, this will operate in O(n). Of course, people are able to create ridiculous things with regexes nowadays, with ridiculous theoretical running time, and who am I to argue against that?

The regular expression must be anchored at the beginning of the substring supplied to match, using ^. Otherwise, there's no way to stop PHP from matching elsewhere entirely in the expression, which is very bad. Caution: formations like /^a|b/ only anchor the a at the start of the string; a b might be matched anywhere! You should use /^(a|b)/ or /^a|^b/.

Callback is passed one argument for each sub-match. For example, if the regex is /^ab(cd(ef)gh)ij/ then the first argument is the whole match, abcdefghij, the second argument is cdefgh and the third argument is ef. The default callback returns only the first argument, the whole match.

  new RegexParser("/^'([a-zA-Z_][a-zA-Z_0-9]*)'/");
  // returns the full match including the single quotes
  
  new RegexParser(
    "/^'([a-zA-Z_][a-zA-Z_0-9]*)'/",
    function($match0, $match1) { return $match1; }
  );
  // discard the single quotes and returns only the inner string

Utf8Parser

Matches a single UTF-8 character. You can supply a blacklist of characters which will not be matched.

  new Utf8Parser(array("<", ">", "&"));
  // any UTF-8 character except the three listed

Callback is passed one argument, the string that was matched. The default callback returns the first argument i.e. the string.

For best results, alternate (see LazyAltParser below) with StringParsers for e.g. &lt;, &gt;, &amp; and other HTML character entities.

Works in O(n) time.

LazyAltParser

This encapsulates the "alternation" parser combinator by alternating between several internal parsers. The key word here is "lazy". As soon as one of them matches, that result is returned, and that's the end of the story. There is no capability to merge the results from several of the internal parsers, and there is no capability for returning (backtracking) to this parser and trying to retrieve other results if the first one turns out to be bogus.

Since there are finitely many internal parsers and each internal parser operates in at worst O(n) time, LazyAltParser works in at worst O(n) time.

Callback is passed one argument, the sole successful internal match. The default callback returns the first argument directly.

  new LazyAltParser(
    array(
      new StringParser("foo"),
      new StringParser("bar")
    )
  );
  // returns either "foo" or "bar"

ConcParser

This encapsulates the "concatenation" parser combinator by concatenating a finite sequence of internal parsers. If the sequence is empty, this is equivalent to EmptyParser, above. Since there are finitely many internal parsers and each internal parser operates in at worst O(n) time, ConcParser works in at worst O(n) time.

Callback is passed one argument for every internal parser, each argument containing the result from that parser. For example, new ConcParser(array($a, $b, $c), $callback) will pass three arguments to its callback. The first contains the result from parser $a, the second the result from parser $b and the third the result from parser $c. The default callback returns the arguments in the form of an array: return func_get_args();.

  new ConcParser(
    array(
      new RegexParser("/^<([a-zA-Z_][a-zA-Z_0-9]*)>/", function($match0, $match1) { return $match1; }),
      new StringParser(", "),
      new RegexParser("/^<(\d\d\d\d-\d\d-\d\d)>/",     function($match0, $match1) { return $match1; }),
      new StringParser(", "),
      new RegexParser("/^<([A-Z]{2}[0-9]{7})>/",       function($match0, $match1) { return $match1; }),
    ),
    function($name, $comma1, $opendate, $comma2, $ref) { return new Account($accountname, $opendate, $ref); }
  );
  // match something like "<Williams>, <2011-06-30>, <GH7784939>"
  // return new Account("Williams", "2011-06-30", "GH7784939")

GreedyMultiParser

This encapsulates the Kleene star "closure" parser combinator to match single internal parser multiple (finitely or infinitely many) times. With a finite upper bound, this is more or less equivalent to ConcParser, above. With an infinite upper bound, this gets more interesting. GreedyMultiParser, as the name hints, will match as many times as it can before returning. There is no option for returning multiple matches simultaneously; only the largest match is returned. And there is no option for backtracking and trying to consume more or fewer instances.

Callback is passed one argument for every match. For example, new GreedyMultiParser($a, 2, 4, $callback) could pass 2, 3 or 4 arguments to its callback. new GreedyMultiParser($a, 0, null, $callback) has an unlimited upper bound and could pass an unlimited number of arguments to its callback. (PHP seems to have no problem with this.) The default callback returns all of the arguments in the form of an array: return func_get_args();.

Remember that a PHP function can be defined as function(){...} and still accept an arbitrary number of arguments.

  new GreedyMultiParser(
    new LazyAltParser(
      array(
        new Utf8Parser(array("<", ">", "&")),                         // match any UTF-8 character except <, > or &
        new StringParser("&lt;",  function($string) { return "<"; }), // ...or an escaped < (unescape it)
        new StringParser("&gt;",  function($string) { return ">"; }), // ...or an escaped > (unescape it)
        new StringParser("&amp;", function($string) { return "&"; })  // ...or an escaped & (unescape it)
      )
    ),
    0,                                                  // at least 0 times
    null,                                               // at most infinitely many times
    function() { return implode("", func_get_args()); } // concatenate all of the matched characters together
  );
  // matches a continuous string of valid, UTF-8 encoded HTML text
  // returns the unescaped string

Even though GreedyMultiParser performs an unlimited number of matches and its internal parser could match in linear time, GreedyMultiParser is itself not O(n2) but O(n)! This was quite a surprising result when I discovered it, and it's a little difficult to prove rigorously. But the basic explanation is: the time taken for any parser listed in this collection to perform a match is linear not in the length of the whole string but in the length of the match made. GreedyMultiParser performs multiple internal matches, and the time taken for each internal match is proportional to the length of that internal match. So, the time taken for the overall match is proportional to the sum of the lengths of the internal matches, which is the length of the string, which is O(n).

Grammars

All of the above is well and good, but it doesn't complete the picture. Firstly, it makes our parsers quite large and confusing to read when they nest too much. Secondly, it makes recursion very difficult; a parser cannot easily be placed inside itself, for example. Without recursion, all we can parse is regular languages, not context-free languages.

The Grammar class makes this very easy. At its heart, Grammar is just another MonoParser. But Grammar accepts an associative array of parsers as input -- meaning each one comes attached to a name. The parsers inside it, meanwhile, can refer to other parsers by name instead of containing them directly. Grammar resolves these references at instantiation time, as well as detecting anomalies like left-recursion, names which refer to parsers which don't exist, dangerous formations such as new GreedyMultiParser(new EmptyParser(), 0, null), and so on.

Here's a simple Grammar which can recognise (some) valid HTML paragraphs and return the content of those paragraphs:

  $p = new Grammar(
    "paragraph",
    array(
      "paragraph" => new ConcParser(
        array(
          "OPEN_P",
          "CONTENT",
          "CLOSE_P"
        ),
        function($open_p, $content, $close_p) {
          return $content;
        }
      ),

      "OPEN_P" => new StringParser("<p>"),

      "CONTENT" => new GreedyMultiParser(
        "UTF-8 CHAR",
        0,
        null,
        function() { return implode("", func_get_args()); }
      ),

      "CLOSE_P" => new StringParser("</p>"),

      "UTF-8 CHAR" => new LazyAltParser(
        array(
          new Utf8Parser(array("<", ">", "&")),                         // match any UTF-8 character except <, > or &
          new StringParser("&lt;",  function($string) { return "<"; }), // ...or an escaped < (unescape it)
          new StringParser("&gt;",  function($string) { return ">"; }), // ...or an escaped > (unescape it)
          new StringParser("&amp;", function($string) { return "&"; })  // ...or an escaped & (unescape it)
        )
      ),
    )
  );
  
  $p->parse("<p>Your text here &amp; here &amp; &lt;here&gt;</p>");
  // returns "Your text here & here & <here>"

To do

  • This is still just a parsing library, not a parser generator in the strictest sense. Grammar instantiation is too time-consuming to do every time it's needed. Make Loco generate code. - This might be a tall order because PHP doesn't make it easy to retrieve the text content of a callback, but it might be possible.
  • Loco should parse Backus-Naur Form and/or other grammar specification languages. - For this, all I really need is a single good Grammar instance which happens to return another Grammar instance as the output from parse(). This much, I actually already have, and it works (if I say so myself) really brilliantly! The problem is that BNF and other languages don't provide any clear mechanism to insert callbacks at each line, and I don't know what mechanism to provide myself, if any. Should the BNF parser take S ::= "a" "b" and look for a class S and attempt to create a call to new S("a", "b") or new S(array("a", "b")), or should it look for a function S and attempt to create a call to S("a", "b") or S(array("a", "b")), or should it do none of things to avoid confusion and just return array("a", "b")? Plenty of work has been done on grammar specification languages, but adding interpretation for these grammars seems to be up in the air. Any advice would be appreciated. Another question is which BNF I should attempt to support - there are several, and they all have issues... Done!
  • Other suggestions?

Back to Code
Back to Things Of Interest

Facebook Twitter Reddit Email Hacker News StumbleUpon

Discussion (21)

2011-07-02 11:13:49 by PhantomHoover:

Seeing you writing parsers in PHP... it breaks my heart, it really does.

2011-07-06 15:26:41 by Sam:

They've been written in basically every other language, as far as I can make out. I saw a gap.

2011-07-07 11:51:27 by anonymous:

Hi,

i have modified Loco.php, to work in PHP versions prior to 5.3 (they do not support anonymous functions,
nor inline function creation).
I have tested it and it works on PHP 5.2.9, but it _should_ work in all PHP 5.x versions.

http://pastebin.com/Vfj2ZzhS

Amazing work, btw ! :)

2011-07-09 16:06:05 by Sam:

Thanks for the suggested change, anonymous stranger. I didn't realise that anonymous in-line functions were so recently added to PHP nor that it was so easy to work around. Having a defaultCallback() abstract method for every MonoCombinator seems like a great idea to me in any case. A new version of Loco.php will be uploaded momentarily.

2011-07-12 03:07:37 by Jymbob:

Sam: Having uploaded it momentarily, could you leave it there in the longer term?
/grammarpolice

2011-07-12 08:08:21 by anonymous:

Hey, there is another thing that is modified in that Loco.php, that i forgot to mention.
It is the line that determines whether or not to run unit tests:
847:if(str_replace('\\', '/', __FILE__) !== $_SERVER['DOCUMENT_ROOT'] . substr($_SERVER["PHP_SELF"], 1)) {
Did that work for you? Because i tested Loco on a windows environment (Wamp (Apache + mod_php)), and the original
if(__FILE__ !== $_SERVER["PHP_SELF"])
didn't work, because of the backslash and also because $_SERVER["PHP_SELF"] does not contain the full path of the file on the file system, but just from the http server root, whereas __FILE__ does contain the full path.

However it probably does contain the full path to the running script in your environment, so if it won't work, perhaps try
if (str_replace('\\', '/', __FILE__) !== $_SERVER["PHP_SELF"]
AND str_replace('\\', '/', __FILE__) !== $_SERVER['DOCUMENT_ROOT'] . substr($_SERVER["PHP_SELF"], 1))
or
if (str_replace('\\', '/', __FILE__) !== $_SERVER['SCRIPT_FILENAME']) {
which is shorter, but i'm not sure if SCRIPT_FILENAME is supported in all environments..

Btw, there is something wrong with the HTML parser of new comment posting, because i tried to put the code snippets in <em>..</em> containers, with HTML parser set to 'Basic' and it kept saying 'Can't parse text.'.

2011-07-12 10:31:14 by Sam:

That line worked for me but that's because I run the unit tests by going to a command line and running "php C:\...\Loco.php". I'll consider the other change though. Also, this seems to work:

if (str_replace('\\', '/', __FILE__) !== $_SERVER["PHP_SELF"] AND str_replace('\\', '/', __FILE__) !== $_SERVER['DOCUMENT_ROOT'] . substr($_SERVER["PHP_SELF"], 1))

2011-07-12 10:32:03 by Sam:

Maybe your <em></em> wasn't wrapped inside a <p></p>?

2011-07-13 12:21:32 by anonymous:

Yeah, i didn't know you need to use <em> only inside <p>.

Actually i would allow <pre> tags if i were you (for code samples).

2011-07-13 23:15:53 by Sam:

Once I start using Loco internally for comment validation, that change ought to be a one-liner ;)

2011-07-27 23:15:44 by OvermindDL:

As stated in the prior article, it sounds like you want a PEG parser. Perhaps this can help locate one for PHP? http://stackoverflow.com/questions/79584/are-there-any-parsing-expression-grammar-peg-libraries-for-javascript-or-php

2011-07-29 14:29:48 by Sam:

If I wanted a PEG parser I'd have got a PEG parser, clearly. Or written one myself, more likely.

2011-10-25 19:35:29 by anonymous:

Hi,

i had a problem with Loco, and i'm not sure if it is a bug or if i don't completely understand how it works.
I had Loco do an infinite loop, while parsing some grammar (not that the grammar was in fact invalid), and as far as i understand it - it wasn't identifying left-recursion in the grammar.
Here is a stripped-down version of that grammar:
$parser = new Parser(
"<A>",
array(
"<A>" => new ConcCombinator(
Array("<B>")
),

"<B>" => new LazyAltCombinator(
Array(
"<C>", "<D>"
)
),

"<C>" => new ConcCombinator(
Array(
new StringCombinator("C"),
)
),

"<D>" => new LazyAltCombinator(
Array(
"<C>", "<A>",
)
),
)
);

This parser is left-recursive, because of the recursion:
A -> B -> D -> A

As for Loco, i located the source of the problem to be the left-recursion check loop (foreach($this->internals as $internal)) in the Parser constructor,
more precisely this portion of it:
for($j = 0; $j < count($firstSet); $j++) {
if($next === $firstSet[$j]) {
break(2);
}

I changed break(2) to break; here and then it correctly threw an error, identifying the parser as left-recursive.

But i have trouble understanding why there was a break(2) there, i'm guessing there is a reason for it ?
Do you remember why you put break(2) there?

Anyway, here is a modified version of Loco, test for this problem is added, all original tests pass too:
http://pastebin.com/CR3KLXzh

2011-10-25 19:38:18 by anonymous:

(notE that the grammar was in fact invalid)*

2011-10-25 20:28:55 by Sam:

Excellent catch, thanks for that.

The purpose of that chunk of code is to find the extended first-set of a combinator. Since it's a set, we must prevent duplication. To prevent duplication, each candidate new member of the extended first-set must be compared with all the existing members of the extended first-set. And, if the candidate member is already a member, we should ignore it and move onto the next candidate.

In other words, "break(2)" should have read "continue(2)". I'll have the modified file uploaded presently.

2011-10-30 16:42:08 by Sam:

Just a small update: I belatedly realise that the term "combinator" doesn't mean "a small parser suitable for combining with other parsers", but "an operator which combines one or more parsers to make another parser". As such, technically this library doesn't contain any parser combinators per se, unless you count the constructors of the LazyAltCombinator, ConcCombinator and GreedyMultiCombinator. So, I've renamed all the *Combinators to *Parsers, and renamed the Parser object to Grammar. This was a simple find/replace operation in my code and it should be equally simple in yours. Nothing else has changed. Apologies for the minor inconvenience.

2011-12-19 10:21:08 by hakre:

Did you consider a license / usage terms for your code? And I think it's a great library, why not put it in github so it's easier to play around with modifications?

2011-12-21 13:37:21 by Sam:

I'm going to look into both of those things. I'm sure I had a license for this code at one point, I can't think what happened to it (the code examples other than Loco.php are in the public domain). I also need to figure out exactly what Github is good for.

2012-02-14 03:51:59 by mindplay:

This has BY FAR the cleanest, most comprehensible format for parsers I've seen in PHP thus far. Nice work!

As pointed out by somebody else, this desperately needs to be on GitHub, so that a community can build around this brilliant work of code...

2012-04-19 20:58:18 by StefanFroelich:

I've been looking for anything less than this for a looong time and just couldn't find it. I'm shocked this has been up since last year, and btw, this is waaaay better than I ever dreamed of.

Ill let you know when I have a release of the language I want to build on php. All I needed was a parser. Thanks and great work.

Ps. The comment form threw an error on my name when I left a space. "Names can consist of only letters"

2013-06-07 02:23:36 by RasmusSchultz:

I know you're not actively working on this for some time, but I came across this:

https://vll.java.net/

It's a really incredible graphical language workbench, and it happens to be built for parser combinators - it writes out the BNF in a nice, simple XML format, which could potentially be parsed and loaded into Loco directly.

Either way, this is definitely very relevant and an amazing tool - better than even the commercial ones I've seen.