I have made some minor site tweaks today.

  • There are now some social bookmarking links at the bottom of each page.
  • RSS feeds only appear on pages where they're likely to be of any use to anybody, namely the front page, blog, fiction and Fine Structure. RSS feeds can still be theoretically generated for any page, though, if you manually alter the URL, and are dumb.
  • Empty discussion threads do not appear on pages where discussion is turned off.
  • Closed discussion threads are marked as such instead of just mysteriously lacking an "add comment" button.
  • Comments have a swanky new HTML parser.

The last point was the one which required the most work. I realise that there are parsers out there which do the job, but I wanted something that was restrictive to the point of being puritanical so that my precious valid XHTML can't be disrupted by commenters who don't know how to mark stuff up. The new parser can handle any legitimate combination of the <p>, <em> and <strong> tags, as well as the HTML character entities for <, > and &, namely &lt;, &gt; and &amp; respectively - nothing else whatsoever is permissible. By only allowing a whitelist of tags and entities rather than blacklisting exploits as and when I find them, I find there's much less logical room to manoeuvre for inserting malicious markup and code. In particular, tag attributes are not permitted (you can't even add extra whitespace inside your tags, which is legal in XHTML - sorry, but this cuts out malicious CSS like style="font-size: 3000%").

I make no apology for how punishingly restrictive the new parser is. You can still, naturally, use the simple old one, which just turns your line breaks into <br />s, but doesn't allow bold or italicised text. Meanwhile, any of you who fancy trying to crack this thing open are welcome to try to poke holes in the new code, using the discussion below. Hope it holds up!

Update 2010-01-16 17:19:40

Okay, I have fixed a bug which was double-escaping special characters. Amazingly, it turned out that characters were being put through htmlspecialchars() somewhere between being POSTed as form data and being available to PHP's $_POST global variable. Being completely unable to find any kind of documentation or bug explaining this behaviour, I eventually magically solved the problem by putting header('Content-type: text/html; charset=UTF-8'); at the beginning of each page load. Why this worked, I don't really know.

It is also now impossible to submit certain special characters which are not permitted in XHTML documents, those between hex codes 0 and 1F. Unfortunately, I have discovered that PHP's Unicode support is extremely poor, varying between experimental and non-existent, so the treatment of special characters is not currently to my liking. There's no function to, for example, turn a single character into its hex sequence and back again, or to check that a specific hex sequence is defined in Unicode or not.

Update 2010-02-05 09:55:12

The HTML parser I had written was a little unwieldy and it looked like it was going to be surprisingly difficult to maintain. I did some research into formal grammars, formal languages, context-free grammars and languages (which HTML is), regular grammars and languages (which HTML is not), formal language parsing and parsers. I formulated the small permissible subset of HTML which makes up comments on this site as a context-free grammar, and then rendered it into Chomsky Normal Form which allowed me to implement the CYK algorithm. Unfortunately, this algorithm has two major drawbacks: it operates in O(n3) time (a comment twice as long takes eight times as long to parse) and it has absolutely no capability to tell the user how, exactly, parsing has failed.

I didn't release this new parser "into the wild" while I mulled over this problem. I quickly realised that the interesting thing about HTML is that there is only a single precise parse tree - while it is possible for the same HTML string can be correctly produced in many different ways, the parse tree arising from this production is always identical. I realised that the reason for this is that all the possible non-terminal characters in this CFG are either left-handed / producers or right-handed / productions - in rules of the form ABC, any given non-terminal character either always appears as B or always appears as C. This sets much greater constraints on how a string can be formed. For example, if the string begins with a right-handed character C, then you know that the string cannot be formed because there is no left-handed B to produce at the same time.

In addition, if the rules are formulated so that BC is a unique key in production rules, then each two-character combination can only be formed in one way, from A (or not at all). So, when parsing, if you encounter BC, you KNOW you can INSTANTLY reduce that to A and continue to parse, no questions asked, no problems raised. Or, if you encounter a combination DE, where the handednesses match but there is no production rule, you KNOW that that combination can NEVER be produced, so you can instantly throw an error. As a result, you can now parse the string in linear time.

I implemented this last night. I can pin this down in more specific terms if anybody cares, but I'm pretty sure I've simply independently discovered one of these (though at first glance it's not easy to tell which one).

You can now use <br /> line breaks, and adding more formatting rules is now trivial for me (if I ever decide to).

I've also added some cunning .htaccess and PHP redirection rules to harmonise and homogenise URLs. URLs of the form "" and "" will still work, but give permanent redirects to the canonical "" page. This little niggling extra character in URLs has been something I've wanted to excise for at least a year, but never got around to it.

The secret? Use .htaccess to redirect to a PHP script, then handle the request URI using nice easy PHP. Much easier than grappling with Apache directives.

Discussion (63)

2010-01-10 18:10:18 by Ejl:

Could you add a ma.gnolia bookmarks thingummy? It's relaunched at works kind of ok atm. also, <PLAINTEXT>

2010-01-10 19:09:33 by Eskivole:

"blacklisting exploits as and when I find them" So Sam is the Imprisoning God?!

2010-01-10 19:11:43 by qntm:

No, as previously intimated, I operate on a whitelist policy :)

2010-01-11 00:17:37 by Raphfrk:

Can you add a "preview" option (or a way to edit comments). Getting html code (or even normal comments) right on the first attempt isn't easy.

2010-01-11 01:32:28 by Thrack:

So no italics or bold? Is having all the text starting on the very edge of the screen also one of the changes you made? I think the text needs to start a little further from the edge as it makes the site look less clean. Also, the Google ads have been pushed to the bottom of the screen, I'm guessing that's a bug? I would also like a preview button, I usually use the preview to reread what I've typed so I can see the message in the format it will be seen by others. I could reread it without using the preview option (and I do so at qntm) but I prefer using a preview. And of course, I use the preview to spot mistakes I made if I used any HTML tags. But whether you decide to add that feature or not, thanks for the HTML parser and various other little tweaks. I'm sure they will make the site look a little nicer.

2010-01-11 01:38:29 by Lucas:

Eskivole, you genius.

2010-01-11 02:54:51 by dankuck:

I don't see the issues Thrack reports. Aside: Do you even have Google ads? I don't see those. I'll toy with this. =)

2010-01-11 03:03:07 by dankuck:


2010-01-11 03:33:34 by dankuck:

toy boat

2010-01-11 03:35:56 by dankuck:

toy goat

2010-01-11 03:46:58 by Sgeo:

Thrack, what do you mean, no **bold?**. Mind you, I'm scared that this might not work, since I see no one else using it...

2010-01-11 03:47:00 by dankuck:

So far, I've managed to break the strict XHTMLness by sending a null character. I connected to port 80 using telnet and sent the following: POST /action_edit_comments.php HTTP/1.1 Host: Content-Type: application/x-www-form-urlencoded User-Agent: Mozilla/4.0 (blah) Cookie: PHPSESSID=my_cookie Content-Length: 75 slug=parser&name=dankuck&parser=2&sqrt=i&text=%3Cp%3Etoy goat%00%3C%2Fp%3E I was hoping to put the null char in a p tag so ($tag_name == "p") would return true inaccurately and maybe I could fit some attributes behind the null char, but I guess PHP is hip to that trick.

2010-01-11 03:48:09 by Sgeo:

Although, of course, <strong> doesn't imply bold if something wants to use something else to indicate <strong>... but why would that be an issue?

2010-01-11 03:57:41 by Thrack:

dankuck, you don't see those problems? That's odd. I'll try another browser. I just tried coming to with Google Chrome and it looks fine so it must be a browser dependent problem. If Sam feels like looking into it, my browser is Firefox 3.5.6

2010-01-11 04:03:09 by Thrack:

Oh, so <b>bold</b> and <i>italics</i> are allowed. <u>Underline.</u> <strike>Strike.</strike> I figured they weren't since Sam didn't specifically mention them. Since he's made a whitelist of allowed tags maybe he should list them all somewhere?

2010-01-11 04:05:03 by dankuck:

My bad, "hep to that trick". Regarding Thrack's results: BTW, I'm using Opera 10.10. @Sgeo, I think <b> is no longer allowed in XHTML. Though I think folks should be able to bold without implying that the content is strong, they say that's what CSS is for.

2010-01-11 04:06:04 by dankuck:


2010-01-11 04:10:56 by Thrack:

Sgeo, I have no idea how you got that bold command to work. I'm getting an error saying <b> and <i> and <u> and <strike> isn't allowed inside or outside the <p> </p> tags. But then again it's been a long time since I used HTML so I probably just forgot something.

2010-01-11 04:16:40 by Thrack:

Oh! The site's formatting is spontaneously fixed! Yay! Must be Sam at work. Dankuck, you say the <strong> tag has to be used to use the <b> tag and others? Maybe I shouldn't bother with the HTML parser for now, all I would have used are <i> and <b> anyway. Maybe even <strike> once in a while.

2010-01-11 09:20:06 by qntm:

Man, I had no idea that null characters were flat-out illegal in XHTML documents. I'm going to have to study this before I can fix it. Cheers.

2010-01-11 10:15:58 by qntm:

Testing illegal characters: ™

2010-01-11 14:57:03 by Warrigal:


2010-01-12 10:22:44 by David:

If you are going to test fancy characters... &#8238;Then you should test the Really fancy ones! (and if this works, it is also a test for leakyness)

2010-01-12 10:30:55 by David:

Well, that was blocked :-) The character in question being "U+202E RIGHT-TO-LEFT OVERRIDE - commonly abbreviated RLO" in case anyone wants to try it or similar via telnet.

2010-01-12 10:48:13 by qntm:

Yup, because that was not a Unicode character but an HTML character entity, which are subject to a whitelist. If you pasted the actual Unicode character in here, that would be perfectly acceptable, though. It is valid Unicode, after all.

2010-01-12 18:06:06 by ejl:

Actually &#8238; appears not to work; it is converted to the HTML character entity somehow *before* the filter looks at it (having pasting the character directly from gnome-character-map, there are errors that mention "&#8238;", not the character U+202E itself).

2010-01-12 18:41:11 by qntm:

I think you may be using the wrong parser. You need to select the basic parser rather than "None".

2010-01-14 01:01:39 by ejl:

&#8238;Oh i see.

2010-01-14 01:02:11 by ejl:

Hmmm not working with "none" or "basic".

2010-01-16 17:15:28 by qntm:

UnicodeThis post was originally twice as long, but a lot of it was disallowed.

2010-01-16 17:19:16 by qntm:


2010-01-20 12:19:35 by CapnBaht:

I can't use the baht symbol... &#3647; CapnBaht haz a sad.

2010-01-20 20:20:06 by Sgeo:

It's not possible for users to delete posts, is it? What happened to the rest of Warrigal's post?

2010-01-20 20:21:56 by qntm:

I can delete posts.

2010-01-21 14:51:53 by qntm:


2010-02-04 19:12:02 by qntm:


2010-02-04 19:12:56 by qntm:


2010-02-04 22:58:09 by Samm:

dsfgsd&f sdfgdsf ** f** sdg

2010-02-05 10:42:26 by qntm:

Test dfsdf

2010-02-05 14:04:18 by Robin:

Ooh, RSS feed has gone a little wonky though... >10 new pages popped up on Google Reader, which were actually old.

2010-02-05 14:20:56 by qntm:

That's because I've increased the size to 20 items.

2010-02-06 00:08:17 by frymaster:

re: your double-escaping issue, you can tell htmlspecialchars() and htmlentities() not to double-encode existing entities: When double_encode is turned off PHP will not encode existing html entities, the default is to convert everything not sure what practical difference those 2 functions make to correctness of your code also, the bloody parser doesn't allow underscores

2010-02-06 00:13:37 by qntm:

I know that. Unfortunately, the version of PHP installed on my web host's server is 5.2.0 and the double_encode parameter wasn't added to htmlspecialchars until PHP 5.2.3. Still, the problem is resolved now.

2010-02-06 00:15:16 by frymaster:

and in case you're wondering why my last comment looks odd, it's because, despite having the parser set to "none" (line breaks become br) it moaned about the CR of my CR/LF pair not being valid. w3 spec says a linebreak can be either CR, LF, _or_ CRLF

2010-02-06 02:43:26 by YarKramer:

Ah, a new way of parsing stuff which is ... um, pretty much the same as when I first came here to read "How to Destroy the Earth"? That explains why the icons in my RSS feed from qntm pages I'd already viewed went back to "blank page" ... ;)

2010-02-06 07:24:42 by Lar:

If one changes <option selected="selected" value="parser">Tweaks</option> to <option selected="selected" value="<SCRIPT SRC=></SCRIPT>">Tweaks</option> in the source, the error page will run the javascript.

2010-02-06 09:56:32 by Samm:

Testing line breaks

2010-02-08 04:10:20 by Lar:

Putting %.3C (except for the period) somewhere in the comment with the either parser causes it to hiccup and go to which only contains the text array(2) { [0]=> string(1) "S" [1]=> string(5) "CDATA" }

2010-02-08 07:38:28 by Ken:

The link <em></em>, which was what I had bookmarked for this site, no longer works.

2010-02-08 09:12:01 by qntm:


2010-02-08 09:14:11 by qntm:

Both fixed, although the second one is going to require some work to fix "properly".

2010-02-08 23:01:18 by qntm:

The XSS injection attack has also been fixed. That was a subtle one, but I'm irritated that it slipped past me.

2010-02-12 10:02:50 by test:


2010-02-12 10:04:34 by test:

Looks like you can still trick the comment field into displaying nothing, though.

2010-02-12 10:42:39 by qntm:

That's an odd one. What exactly did you send in the comment body?

2010-02-17 23:47:42 by Lar:


2010-02-18 00:41:44 by Lar:

So, I got the javascript to render after going into about:config and changing intl.charsetmenu.browser.unicode. While not particularly serious as of now, this could become a security flaw if a future browser release fails to respect the original page's encoding type.

2010-02-18 09:19:08 by qntm:

I'd be interested to know exactly what alternate page encoding you used, because that's just a string of harmless ASCII right there. I would also like to know how to defend against that particular exploit - if indeed I have any responsibility to handle what is surely a monstrous hypothetical browser vulnerability.

2010-02-18 14:55:56 by tute:

2010-02-18 15:01:18 by qntm:

Who in their right mind wants to learn something other than HTML for marking up web pages? Let alone learning yet another markup schema after Wiki markup and markdown and all.

2010-02-18 15:11:35 by Lar:

Sorry--it's UTF-7. UTF-7 renders < and > as +ADw- and +AD4-. In order to be able to select it in the list of character encodings, you need to add it into the above about:config preference (probably because of this security problem with the encoding.) I used the basic parser, so you might just want to hard code out allowing +ADw- and +AD4-.

2010-02-18 18:07:50 by qntm:

I've managed to duplicate this issue. It looks like all you need to do is manually override the character encoding using View/Character Encoding/Other/Unicode/UTF-7. I'm of the opinion that this is not something which it's my responsibility to protect the user against. I can't validate user comments in every character encoding under the Sun. This page is presented with the correct character-set HTTP header and <meta> tag, and if the browser fails to respect that, then that is a massive security flaw in the browser, which is not my responsibility, or a deliberate piece of stupidity on the part of the user, which nothing can protect against. Still, a fun piece of information and worth knowing. Thanks!

2010-02-22 08:08:20 by test:

Oh, uh, the "blank comment" message I posted earlier (which you've now broken, thankfully) was done by just pressing alt+0173. It's the non-displaying hyphen character.

New comment by :

Plain text only. Line breaks become <br/>
The square root of minus one: