More PHP parsing fun

Okay!

This is a much more useful collection of PHP scripts than the earlier ones because they are actually useful for the purposes of parsing (well, validation).

In order to make this work, put these three files in the same directory so that they can require_once() one another properly.

Here's a simple script which demonstrates how to use these libraries. Or modules or packages or whatever they're called in PHP.

Info

Lexers

There is no explicit line in the sand which says what a lexer must do and what the actual parser must do. In theory, parser combinators are of unlimited power and can handle actual context-free grammars and better, but for practical purposes this is not what parser combinators are best at. Likewise it a waste of processing power running pedestrian finite-state-machine nonsense on a full-blown CFG parser. So, basically, anything that a regex can do (i.e. concatenation, alternation, multiplication), should be done with parser combinators and anything that a regex can't do (i.e. recursion) should be left up to the parser.

So, there is now a Lexer object. For our purposes a lexer is a list of parser combinators. I created a bunch of useful combinators. A StringCombinator will just recognise a static string. A RegexCombinator recognises a regular expression. A BinaryCombinator uses a bitmask to recognise any of a variety of static strings (hopefully faster than a StringCombinator although I never found much use for it). (Remember that in PHP a string and a sequence of binary data are the same thing.) A Utf8Combinator recognises any valid UTF-8 character which falls inside the universally safe ranges for XML and which doesn't fall inside a blacklist which you can pass in as an attribute.

Combinators can also be combined. An AltCombinator performs alternation ("or"), by taking multiple combinators and returns the results from all of them. A LazyAltCombinator takes multiple combinators and returns only the first positive result. A ConcCombinator performs concatenation, by taking multiple combinators and stringing them together one after the other. A StarCombinator performs the same operation as the asterisk does in a regular expression, and returns every possible value of j. A GreedyStarCombinator does the same but only returns the maximum possible value of j (which could of course be zero characters).

Context-free grammars

CFGs work pretty much exactly how they did last time, but it has a very large amount of the nonsense and checks stripped out for efficiency. The syntax for creating CFGs is a little different now, see the example script.

CFG Rules now have a "starify" flag which makes some important optimisations to how the Railroad is generated. Previously, the inability to "starify" things meant that stacks would grow without limit and parsing time would grow substantially worse than linearly.

Railroads

A Railroad is kind of like a pushdown automaton only, again, it has a lot of nonsense stripped out, and no longer strictly adheres to the mathematical model. The syntax for creating a railroad is pretty straightforward but the best way to create it is to use the $cfg->getRailroad() method. If you want to see the syntax, just use the $rr->show() method.

The construction is based on railroad diagrams, obviously.

Why am I explaining all this? See demo.php for examples and useful information.

Conclusion

I call this success because from now on I will be using this stuff to validate comments here on qntm.org, and that means other people will be able to do the same.

$lexer = new Lexer(
	array(
		new RegexCombinator("|^[ \n\r\t]+|", "WHITESPACE"),

		new RegexCombinator("|^<p[ \n\r\t]*>|", "OPEN_P"),
		new RegexCombinator("|^</p[ \n\r\t]*>|", "CLOSE_P"),

		new RegexCombinator("|^<h5[ \n\r\t]*>|", "OPEN_H5"),
		new RegexCombinator("|^</h5[ \n\r\t]*>|", "CLOSE_H5"),

		new RegexCombinator("|^<em[ \n\r\t]*>|", "OPEN_EM"),
		new RegexCombinator("|^</em[ \n\r\t]*>|", "CLOSE_EM"),

		new RegexCombinator("|^<strong[ \n\r\t]*>|", "OPEN_STRONG"),
		new RegexCombinator("|^</strong[ \n\r\t]*>|", "CLOSE_STRONG"),

		new RegexCombinator("|^<br[ \n\r\t]*/>|", "FULL_BR"),

		new GreedyStarCombinator(
			new LazyAltCombinator(
				array(
					new Utf8Combinator(array("<", ">", "&"), null), // any UTF-8 character except <, > or &
					new StringCombinator("&lt;", null), // ... or an escaped <
					new StringCombinator("&gt;", null), // ... or an escaped >
					new StringCombinator("&amp;", null) // ... or an escaped &
				),
				null
			),
			"CDATA"
		)
	)
);

$cfg = new ContextFreeGrammar(
	"<comment>",
	array(
		"<comment>" => array(
			new Rule(array("<blockorwhitespace>"), true) // note the "starify" flag is set
		),
		"<blockorwhitespace>" => array(
			new Rule(array("<h5>")),
			new Rule(array("<p>")),
			new Rule(array("WHITESPACE"))
		),
		"<p>" => array(
			new Rule(array("OPEN_P", "<text>", "CLOSE_P"))
		),
		"<h5>" => array(
			new Rule(array("OPEN_H5", "<text>", "CLOSE_H5"))
		),
		"<strong>" => array(
			new Rule(array("OPEN_STRONG", "<text>", "CLOSE_STRONG"))
		),
		"<em>" => array(
			new Rule(array("OPEN_EM", "<text>", "CLOSE_EM"))
		),
		"<text>" => array(
			new Rule(array("<atom>"), true)
		),
		"<atom>" => array(
			new Rule(array("<strong>")),
			new Rule(array("<em>")),
			new Rule(array("FULL_BR")),
			new Rule(array("WHITESPACE")),
			new Rule(array("CDATA"))
		)
	)
);

$rr = $cfg->getRailroad();

$tokens = $this->lexer->tokenize($text); // this may throw an exception

if(!$this->rr->willAcceptTokens($tokens)) {
	throw new Exception("Can't parse text."); // todo: more verbose errors
}
Back to Code
Back to Things Of Interest
StumbleUpon Twitter Hacker News Facebook Reddit Digg del.icio.us Email

Discussion (6)

2011-03-06 21:00:00 by Sam:

Testing the basic parser!


Blaaaaaah...

2011-03-06 21:01:20 by Sam:

Testing the advanced parser!
Title
&

<

2011-03-06 21:38:10 by Jymbob:

Mmm. Regexy goodness in forms. I like it.
So does it automatically convert spare brackets that I've left lying around?
Strip them out? Leave them in??
I tried to put a few < and >s in but it seems it doesn't try to convert them. I suppose that's fair enough. It doesn't convert a single & either, which might be kind of useful. Also everything must be wrapped in a paragraph.

2011-03-06 22:34:38 by Sam:

And apologies for the regression in the quality of error messages when you get the HTML wrong, but frankly I'm sick of this whole sordid business and want it to be over and nobody uses the clever parser anyway.

There is no royal road to parsing.

2011-03-07 11:11:13 by Mike:

Oh, man, this is going to be useful...

2011-05-24 12:30:20 by asdf:

safd

add comment