Okay!
This is a much more useful collection of PHP scripts than the earlier ones because they are actually useful for the purposes of parsing (well, validation).
In order to make this work, put these three files in the same directory so that they can require_once() one another properly.
Here's a simple script which demonstrates how to use these libraries. Or modules or packages or whatever they're called in PHP.
Info
Lexers
There is no explicit line in the sand which says what a lexer must do and what the actual parser must do. In theory, parser combinators are of unlimited power and can handle actual context-free grammars and better, but for practical purposes this is not what parser combinators are best at. Likewise it a waste of processing power running pedestrian finite-state-machine nonsense on a full-blown CFG parser. So, basically, anything that a regex can do (i.e. concatenation, alternation, multiplication), should be done with parser combinators and anything that a regex can't do (i.e. recursion) should be left up to the parser.
So, there is now a Lexer object. For our purposes a lexer is a list of parser combinators. I created a bunch of useful combinators. A StringCombinator will just recognise a static string. A RegexCombinator recognises a regular expression. A BinaryCombinator uses a bitmask to recognise any of a variety of static strings (hopefully faster than a StringCombinator although I never found much use for it). (Remember that in PHP a string and a sequence of binary data are the same thing.) A Utf8Combinator recognises any valid UTF-8 character which falls inside the universally safe ranges for XML and which doesn't fall inside a blacklist which you can pass in as an attribute.
Combinators can also be combined. An AltCombinator performs alternation ("or"), by taking multiple combinators and returns the results from all of them. A LazyAltCombinator takes multiple combinators and returns only the first positive result. A ConcCombinator performs concatenation, by taking multiple combinators and stringing them together one after the other. A StarCombinator performs the same operation as the asterisk does in a regular expression, and returns every possible value of j. A GreedyStarCombinator does the same but only returns the maximum possible value of j (which could of course be zero characters).
Context-free grammars
CFGs work pretty much exactly how they did last time, but it has a very large amount of the nonsense and checks stripped out for efficiency. The syntax for creating CFGs is a little different now, see the example script.
CFG Rules now have a "starify" flag which makes some important optimisations to how the Railroad is generated. Previously, the inability to "starify" things meant that stacks would grow without limit and parsing time would grow substantially worse than linearly.
Railroads
A Railroad is kind of like a pushdown automaton only, again, it has a lot of nonsense stripped out, and no longer strictly adheres to the mathematical model. The syntax for creating a railroad is pretty straightforward but the best way to create it is to use the $cfg->getRailroad() method. If you want to see the syntax, just use the $rr->show() method.
The construction is based on railroad diagrams, obviously.
Why am I explaining all this? See demo.php for examples and useful information.
Conclusion
I call this success because from now on I will be using this stuff to validate comments here on qntm.org, and that means other people will be able to do the same.
$lexer = new Lexer(
array(
new RegexCombinator("|^[ \n\r\t]+|", "WHITESPACE"),
new RegexCombinator("|^<p[ \n\r\t]*>|", "OPEN_P"),
new RegexCombinator("|^</p[ \n\r\t]*>|", "CLOSE_P"),
new RegexCombinator("|^<h5[ \n\r\t]*>|", "OPEN_H5"),
new RegexCombinator("|^</h5[ \n\r\t]*>|", "CLOSE_H5"),
new RegexCombinator("|^<em[ \n\r\t]*>|", "OPEN_EM"),
new RegexCombinator("|^</em[ \n\r\t]*>|", "CLOSE_EM"),
new RegexCombinator("|^<strong[ \n\r\t]*>|", "OPEN_STRONG"),
new RegexCombinator("|^</strong[ \n\r\t]*>|", "CLOSE_STRONG"),
new RegexCombinator("|^<br[ \n\r\t]*/>|", "FULL_BR"),
new GreedyStarCombinator(
new LazyAltCombinator(
array(
new Utf8Combinator(array("<", ">", "&"), null), // any UTF-8 character except <, > or &
new StringCombinator("<", null), // ... or an escaped <
new StringCombinator(">", null), // ... or an escaped >
new StringCombinator("&", null) // ... or an escaped &
),
null
),
"CDATA"
)
)
);
$cfg = new ContextFreeGrammar(
"<comment>",
array(
"<comment>" => array(
new Rule(array("<blockorwhitespace>"), true) // note the "starify" flag is set
),
"<blockorwhitespace>" => array(
new Rule(array("<h5>")),
new Rule(array("<p>")),
new Rule(array("WHITESPACE"))
),
"<p>" => array(
new Rule(array("OPEN_P", "<text>", "CLOSE_P"))
),
"<h5>" => array(
new Rule(array("OPEN_H5", "<text>", "CLOSE_H5"))
),
"<strong>" => array(
new Rule(array("OPEN_STRONG", "<text>", "CLOSE_STRONG"))
),
"<em>" => array(
new Rule(array("OPEN_EM", "<text>", "CLOSE_EM"))
),
"<text>" => array(
new Rule(array("<atom>"), true)
),
"<atom>" => array(
new Rule(array("<strong>")),
new Rule(array("<em>")),
new Rule(array("FULL_BR")),
new Rule(array("WHITESPACE")),
new Rule(array("CDATA"))
)
)
);
$rr = $cfg->getRailroad();
$tokens = $this->lexer->tokenize($text); // this may throw an exception
if(!$this->rr->willAcceptTokens($tokens)) {
throw new Exception("Can't parse text."); // todo: more verbose errors
}
Discussion (6)
2011-03-06 21:00:00 by Sam:
2011-03-06 21:01:20 by Sam:
2011-03-06 21:38:10 by Jymbob:
2011-03-06 22:34:38 by Sam:
2011-03-07 11:11:13 by Mike:
2011-05-24 12:30:20 by asdf:
add comment