Convert a context-free grammar to Chomsky Normal Form in PHP

Backus-Naur Form is the standard means by which a context-free grammar may be represented in text. Backus-Naur Form can be used to represent any context-free rule from any context-free grammar.

However, it is sometimes more useful to render a context-free grammar into Chomsky Normal Form. In Chomsky Normal Form, only the following forms of context-free rule are allowed:

It can be proven that any context-free grammar, whether presented in Backus-Naur Form or otherwise, can be converted into Chomsky Normal Form. The following PHP script describes a Rule class and a ContextFreeGrammar class to which Rules can be added. The ContextFreeGrammar class also contains a function toCnf() which can be called to convert the grammar to Chomsky Normal Form.

Use

Use of this context-free grammar library and the toCnf() function is demonstrated in this simple script:

Sorry, Rule can't understand asterisk formations! To get rid of asterisks, simply turn e.g.:

  • ABC*D

into

  • C' → ε|CC'
  • ABC'D

Algorithm

The algorithm used is the one described on the Wikipedia entry for Chomsky Normal Form, rearranged a little to make the output shorter. For example, breaking up long rules (over 2 entries) into smaller rules is done quite early on as this usually results in a shorter output.

Regardless of this, it is quite likely that the output from toCnf() will be much longer than the input. This is unavoidable. In effect, a conversion to Chomsky Normal Form eliminates horizontal complexity in a grammar's rules and replaces it with vertical complexity. (That is: more rules, but simpler.)

This algorithm is not the best known in terms of output grammar size. However, it is the simplest that was available to me and implementing it was complicated enough already! A function eliminateUselesses() does eliminate useless symbols and rules.

// Convert a CFG to Chomsky Normal Form
public function toCnf() {
	$this->eliminateTerminals();
	$this->eliminateMultiples();
	$this->introduceS0(); // introduces a unit
	$this->eliminateEpsilons(); // may introduce units
	$this->eliminateUnits();
	$this->eliminateUselesses();
}

PHP? Seriously?!

My current ongoing pet project is input validation for web forms. Cross-site scripting attacks are incredibly prevalent. There are many ways to avoid them. One of them is to escape literally everything input by a user and thereby permit no formatting whatsoever. Another is to permit users to use some hideous non-standard non-HTML markup, such as Wikipedia's syntax or Markdown. There are a million of these formats - seemingly one per wiki - and they are all very slightly different which makes learning them a massive irritation, particularly if, like me, you already know the one true markup language, namely HTML itself.

The third is to validate the user's input HTML and ensure that it conforms to a permissible "safe" subset of HTML. If you have ever commented on my website then you'll see that this option is open to you: you are allowed to use <h5>, <p>, <em>, <strong>, <br />, &lt;, &gt; and &amp;, as long as everything is nested correctly and all the tags are closed and so on. Other tags and entities are simply rejected.

This is currently accomplished using an HTML validator which I hand-wrote. It has been tweaked since last time I spoke about it, but it is still much the same as it was. Writing the validator itself was great fun because it only needed to be done once, but constructing the grammar of input rules was pretty time-consuming. If I want to add new tags to my whitelist, I would need to rewrite it quite substantially and it would get a whole lot longer too.

Therefore, my current aim is to write a PHP script which will let me specify the grammar in Backus-Naur Form or similar, and just push a button and receive a new PHP validator. I'm building a validator generator in PHP. (This is one significant step below a parser generator but that's on the horizon.) If made sufficiently broad, it's my hope that I can use it to validate my own HTML when I create new pages like this, and maybe distribute it so other people can use it on their sites too.

So anyway, during this process I decided that it would be easier to handle an arbitrary context-free grammar if I first rendered it into Chomsky Normal Form. It turned out that I was mistaken, and I am now using an entirely different approach, which means toCnf() is useless to me. Still, it might be useful to you!

Back to Code
Back to Things Of Interest
StumbleUpon Twitter Hacker News Facebook Reddit Digg del.icio.us Email

Discussion (8)

2011-05-23 10:25:18 by sk:

I tried to used your converter but I am getting this error.

Fatal error: Uncaught exception 'Exception' with message 'Can't add terminal '10', must have length 1.' in cfg.php:95

This restricts terminals to be, say between 0 and 9.?

Would it possible to it s.t. we can terminals like 10,11,12, ..

Also same for non-terminals --if they are restricted to size 1 as well (asa I can read from the code, they are not though)

Thanks!

2011-05-23 11:26:37 by Sam:

A terminal is basically just a single character. This allows the characters from '0' to '9' and also 'a' to 'z' and any other symbol or letter that you can think of. You are not restricted to just numbers. In demo.php you can see how this works.

If you want to allow multiple characters in a row, list them separately in your list of terminals and list them one after the other in your rules. E.g. instead of

new ContextFreeGrammar("<S>", array("10"), array(), array(new Rule("<S>", array("10"))))

put

new ContextFreeGrammar("<S>", array("1", "0"), array(), array(new Rule("<S>", array("1", "0"))))

I hope this makes sense.

I will look into modifying cfg.php to not have this restriction.

2011-05-23 20:08:01 by sk:

Yes, I meant characters such as 0-9 or a-z. It would be great if this is not forced by the grammar, since, writing grammars with strings is just so natural.

Say for instance; Sentence -> Subject Verb Object

If I understand you correctly, we can encode these terminals/non-terminals as arrays ("S", "e", "n", "t", "e", "n", "c", "e") and so on.

Btw, thank you heaps for sharing your code --I think, it is filling a great void, because all I could not come up with any cnf converter online, which is heartbreaking. I mean, formal languages have been studied for decades by now, and I would expect to find a decent converter upon simple search. But there is none, except yours and a low quality commercial tool (buying software to convert cfg's to chomsky normal form, no thanks!)

So in that regard, I am more than happy to help you regarding the code, if you could point me in the right direction (I never used php before though).

I believe I am sitting on a nice set of grammars, which are derived from programming languages like Pascal, C++, Erlang, SQL, and I need to convert them to Chomsky normal form. So, once ahead of such syntactic issues, we can put toCNF() function into a nice test, and makes lives of others who would need a converter a whole a lot easier.

Thanks again, let's keep in touch.


2011-05-27 23:48:01 by Sam:

I'm still working on this but I think you should understand that "Sentence", "Subject", "Verb" and "Object" are non-terminals, not terminals. "Sentence", for example, is a wildcard, not a finished component of a finished sentence.

2011-05-30 03:42:51 by karthij:

Hi,
Basic information i need to know..plz tell me how to convert CFG like
S->aSb
S->ab
L(G)={aabb,aaabbb,....}

How to write code for this?

2011-05-30 14:14:02 by Sam:

You can now have multi-character terminals.

2011-05-30 14:21:25 by Sam:

$cfg = new ContextFreeGrammar(
"<S>",
array("a", "b"),
array(),
array(
new Rule("<S>", array("a", "<S>", "b")),
new Rule("<S>", array("a", "b"))
)
);
$cfg->toCnf();

2011-10-20 11:18:17 by Jsor:

@sk

The thing with grammars is that when you pass something into anything that deals with grammars (such as an LR parser), you've usually tokenized it already. For instance, if you're doing something with the tool bison, you're usually passing in code from the scanner flex, this means that something like 10 (or the regular expression [0-9]* in general) might be defined in your scanner as the token NUMBER, then rather than the parser having to handle "10", it rather handles some vaguely defined token like NUMBER (which is usually just an integer alias, but could be shoved off to a char or something similar). Of course, this is for parsing input and running it through a grammar checker, not converting to CNF, but the same idea generally applies: grammars are usually defined formally in terms of single tokens (S -> a S | a), anything else is really just syntactic sugar of a given parser implementation to allow you to write clearer code, a way to assign aliases if you want to call it that.

add comment