Backus-Naur Form is the standard means by which a context-free grammar may be represented in text. Backus-Naur Form can be used to represent any context-free rule from any context-free grammar.
However, it is sometimes more useful to render a context-free grammar into Chomsky Normal Form. In Chomsky Normal Form, only the following forms of context-free rule are allowed:
- S → ε, where S is the start symbol
- A → BC, where B and C are non-terminal symbols and B ≠ S and C ≠ S
- A → α, where A is a nonterminal symbol and α is a terminal symbol.
It can be proven that any context-free grammar, whether presented in Backus-Naur Form or otherwise, can be converted into Chomsky Normal Form. The following PHP script describes a Rule class and a ContextFreeGrammar class to which Rules can be added. The ContextFreeGrammar class also contains a function toCnf() which can be called to convert the grammar to Chomsky Normal Form.
Use
Use of this context-free grammar library and the toCnf() function is demonstrated in this simple script:
Sorry, Rule can't understand asterisk formations! To get rid of asterisks, simply turn e.g.:
- A → BC*D
into
- C' → ε|CC'
- A → BC'D
Algorithm
The algorithm used is the one described on the Wikipedia entry for Chomsky Normal Form, rearranged a little to make the output shorter. For example, breaking up long rules (over 2 entries) into smaller rules is done quite early on as this usually results in a shorter output.
Regardless of this, it is quite likely that the output from toCnf() will be much longer than the input. This is unavoidable. In effect, a conversion to Chomsky Normal Form eliminates horizontal complexity in a grammar's rules and replaces it with vertical complexity. (That is: more rules, but simpler.)
This algorithm is not the best known in terms of output grammar size. However, it is the simplest that was available to me and implementing it was complicated enough already! A function eliminateUselesses() does eliminate useless symbols and rules.
// Convert a CFG to Chomsky Normal Form
public function toCnf() {
$this->eliminateTerminals();
$this->eliminateMultiples();
$this->introduceS0(); // introduces a unit
$this->eliminateEpsilons(); // may introduce units
$this->eliminateUnits();
$this->eliminateUselesses();
}
PHP? Seriously?!
My current ongoing pet project is input validation for web forms. Cross-site scripting attacks are incredibly prevalent. There are many ways to avoid them. One of them is to escape literally everything input by a user and thereby permit no formatting whatsoever. Another is to permit users to use some hideous non-standard non-HTML markup, such as Wikipedia's syntax or Markdown. There are a million of these formats - seemingly one per wiki - and they are all very slightly different which makes learning them a massive irritation, particularly if, like me, you already know the one true markup language, namely HTML itself.
The third is to validate the user's input HTML and ensure that it conforms to a permissible "safe" subset of HTML. If you have ever commented on my website then you'll see that this option is open to you: you are allowed to use <h5>, <p>, <em>, <strong>, <br />, <, > and &, as long as everything is nested correctly and all the tags are closed and so on. Other tags and entities are simply rejected.
This is currently accomplished using an HTML validator which I hand-wrote. It has been tweaked since last time I spoke about it, but it is still much the same as it was. Writing the validator itself was great fun because it only needed to be done once, but constructing the grammar of input rules was pretty time-consuming. If I want to add new tags to my whitelist, I would need to rewrite it quite substantially and it would get a whole lot longer too.
Therefore, my current aim is to write a PHP script which will let me specify the grammar in Backus-Naur Form or similar, and just push a button and receive a new PHP validator. I'm building a validator generator in PHP. (This is one significant step below a parser generator but that's on the horizon.) If made sufficiently broad, it's my hope that I can use it to validate my own HTML when I create new pages like this, and maybe distribute it so other people can use it on their sites too.
So anyway, during this process I decided that it would be easier to handle an arbitrary context-free grammar if I first rendered it into Chomsky Normal Form. It turned out that I was mistaken, and I am now using an entirely different approach, which means toCnf() is useless to me. Still, it might be useful to you!
Discussion (8)
2011-05-23 10:25:18 by sk:
2011-05-23 11:26:37 by Sam:
2011-05-23 20:08:01 by sk:
2011-05-27 23:48:01 by Sam:
2011-05-30 03:42:51 by karthij:
2011-05-30 14:14:02 by Sam:
2011-05-30 14:21:25 by Sam:
2011-10-20 11:18:17 by Jsor:
add comment