RailroadParser: a (finally finished) PHP parser generator

Alert! Don't use this! It is hideously, unnecessarily, unmaintainably complicated under the covers and I no longer support it in any meaningful sense. Use Loco instead! Much more sensible.

Here's the PHP script containing the Lexer and Parser classes that you'll need to create your own PHP lexers and parsers.

And here are some pre-built samples.

How do I use RailroadParser?

Here's a simple worked example with a million comments: bracketMatch.php.

For more advanced demonstrations, I recommend inspecting and fooling around with the other examples above, such as the JSON example. Also take a look at the unit tests to see some usage examples and failure cases.

How does RailroadParser work?

RailroadParser is a top-down parser similar to an LL(1) parser.

As the name implies, RailroadParser is based on the concept of the railroad diagram. As parsing progresses, the parser maintains two pieces of information. The first is the incomplete parse tree. The second is the physical position on the railroad, which includes a stack of nested rules that we are currently inside. The position is singular (i.e. the parser is deterministic and can only be in one state at a time) and fully describes the current location on the railroad, including which rules need to be "popped" off the stack in order to exit the railroad at the far end.

In order to figure out which instantaneous description to jump to next, RailroadParser has to look into the future. Because of this, RailroadParser cannot be used with left-recursive grammars because a left-recursive grammar results in situations where the number of possible future states is infinite. However, all left-recursive grammars can be rewritten to be right-recursive instead, which are fine. Left-recursion is caught at Parser instantiation time.

Because RailroadParser is deterministic, it cannot handle non-deterministic context-free grammars, only deterministic ones, also known as the LL(k) grammars. RailroadParser can only accept grammars specified in an LL(1) form.

Back to Code
Back to Things Of Interest
StumbleUpon Twitter Hacker News Facebook Reddit Digg del.icio.us Email

Discussion (14)

2011-04-17 11:00:22 by skztr:

I don't see any copyright notice or license. Do you intend others to be able to use this, or are you just looking for comment?

2011-04-17 11:02:47 by Sam:

I knew I'd forgotten something. That's a pretty big question. I'll think about it and get back to you.

2011-04-17 20:09:48 by Sam:

I guess everybody's got to be a lawyer these days.

I did a little research and selected the Creative Commons Attribution-ShareAlike 3.0 Unported License: http://creativecommons.org/licenses/by-sa/3.0/

This is also noted in the code. Hope this helps!

2011-04-18 07:44:46 by skztr:

Have you seen
http://wiki.creativecommons.org/FAQ#Can_I_use_a_Creative_Commons_license_for_software.3F
?

Though I won't pretend to understand why it isn't recommended.

2011-04-18 10:06:10 by Sam:

No, I haven't. I hate software licensing. I will do some more research and get back to you.

2011-04-18 21:50:57 by tef:

I mentioned this elsewhere, but I think you've written a top down parser, not at a LR(1) parser :-)

2011-04-20 19:31:46 by Sam:

Having read the definition of a top-down parser it looks like you are correct. This is not an LR parser (which is a type of bottom-up parser) but a top-down parser. However although it is not an LR parser, it does generate a rightmost derivation.

2011-05-12 03:08:10 by RasmusSchultz:

You noted near the end of the source code that getting the body of an anonymous function (closure) is "near impossible" - I haven't attempted it yet, but according to the manual, ReflectionFunction supports closures:

http://www.php.net/manual/en/reflectionfunction.construct.php

Using the starting/ending line number accessors, you should be able to extract the portion of the source code from the file, as in the user-contributed note at the bottom of this page:

http://www.php.net/manual/en/reflection.extending.php

I realize that is far from the only challenge - just wanted to point out that this is probably not a show stopper :-)

Fantastic work, by the way!

2011-07-06 08:52:46 by Bushee:

Hi Sam,
great work with the parser :)
How about code license? Have you chosen it yet?

2011-07-06 10:15:40 by Sam:

MIT license, it's there in the code now. The examples are in the public domain, however. Also, see the new hotness: http://qntm.org/loco

2011-07-12 03:03:44 by intelChrisAKAChrisClark:

Your comment about all LL(k) grammars being rewritable is technically incorrect. There are trivial examples of languages which are LL(k+1) but not LL(k), and thus not LL(1), and thus you cannot write an LL(1) grammar for them. It is not a major point, as most pratical languages have LL(1) versions--if one allows the "if-then-else hack".

2011-07-12 16:01:38 by Bushee:

Hello again,
I'd like to suggest you 2 little changes:
1. In your code, in 2 places of main code (AltCombinator's constructor and Parser's constructor) as well as in few "unit tests", you use anonymous function (function($array) { return $array[0]; } as example). Could you please rewrite these uses with either:
- call to create_function (create_function('$array', 'return $array[0];'); in the same example), or
- call to previously created static method (e.g. array('Parser', 'returnArraysFirstElement'))?
It's a pity I use a bit older PHP version, which doesn't support anonymous functions, and upgrading PHP isn't a possibility for me. Ofcourse, I could replace these usages by myself (which I have already done so far ;)), but in case of any update of your code, I would have to remember about it each time...
Anyway, this is a bit problematic - I guess, not only to me :)

2. In my lexer implementation (which extends your Lexer class) I had to use extractFirstToken() method, which was private, so I had to change its visibility modifier to protected. For the same reason as above - could you please rethink which 'private' methods could become 'protected' (or maybe even all of them)?

2011-07-13 23:09:58 by Sam:

1. Done and done. As mentioned over in the Loco thread, I didn't realise that anonymous functions were so new to PHP. Supporting them is a source of irritation because of how inconvenient the create_function() syntax is, but eventually I came up with a regular expression to replace all of the offending anonymous functions for me quite quickly. I re-ran all the unit tests and I have uploaded new versions of everything including the examples. Anonymous functions will continue to work as before, however.

2. I hadn't anticipated that people would be subclassing Lexer and Parser in addition to just making direct use of them! This change has been made and I hope it helps.

3. You should use Loco instead! It's easier to use and I hope it's going to be easier for me to maintain as well. I have learned an important lesson about putting code in public today: people use it!

2011-07-15 11:47:50 by Bushee:

Big thanks, everything's perfect now :)
About Loco - thanks for notice, but I see it is more like lexer & parser blended together. And for my purposes (separate lexer + parser approach for better compatibility with other solutions of that kind), and due to my bigger experience with such approach, I will stick with RailroadParser for now :)
However, I would be really looking forward and kept my fingers crossed, if you've ever tried to write an bottom-up parser (SLR(1) or LALR(1) especially) - it would be just perfect :D
Maybe, if I find *some* time in the future and study mechanisms working in back-end of bottom-up parsers, I would try to fix one myself :)

add comment