I have made some minor site tweaks today.
The last point was the one which required the most work. I realise that there are parsers out there which do the job, but I wanted something that was restrictive to the point of being puritanical so that my precious valid XHTML can't be disrupted by commenters who don't know how to mark stuff up. The new parser can handle any legitimate combination of the <p>, <em> and <strong> tags, as well as the HTML character entities for <, > and &, namely <, > and & respectively - nothing else whatsoever is permissible. By only allowing a whitelist of tags and entities rather than blacklisting exploits as and when I find them, I find there's much less logical room to manoeuvre for inserting malicious markup and code. In particular, tag attributes are not permitted (you can't even add extra whitespace inside your tags, which is legal in XHTML - sorry, but this cuts out malicious CSS like style="font-size: 3000%").
I make no apology for how punishingly restrictive the new parser is. You can still, naturally, use the simple old one, which just turns your line breaks into <br />s, but doesn't allow bold or italicised text. Meanwhile, any of you who fancy trying to crack this thing open are welcome to try to poke holes in the new code, using the discussion below. Hope it holds up!
Okay, I have fixed a bug which was double-escaping special characters. Amazingly, it turned out that characters were being put through htmlspecialchars() somewhere between being POSTed as form data and being available to PHP's $_POST global variable. Being completely unable to find any kind of documentation or bug explaining this behaviour, I eventually magically solved the problem by putting header('Content-type: text/html; charset=UTF-8'); at the beginning of each page load. Why this worked, I don't really know.
It is also now impossible to submit certain special characters which are not permitted in XHTML documents, those between hex codes 0 and 1F. Unfortunately, I have discovered that PHP's Unicode support is extremely poor, varying between experimental and non-existent, so the treatment of special characters is not currently to my liking. There's no function to, for example, turn a single character into its hex sequence and back again, or to check that a specific hex sequence is defined in Unicode or not.
The HTML parser I had written was a little unwieldy and it looked like it was going to be surprisingly difficult to maintain. I did some research into formal grammars, formal languages, context-free grammars and languages (which HTML is), regular grammars and languages (which HTML is not), formal language parsing and parsers. I formulated the small permissible subset of HTML which makes up comments on this site as a context-free grammar, and then rendered it into Chomsky Normal Form which allowed me to implement the CYK algorithm. Unfortunately, this algorithm has two major drawbacks: it operates in O(n3) time (a comment twice as long takes eight times as long to parse) and it has absolutely no capability to tell the user how, exactly, parsing has failed.
I didn't release this new parser "into the wild" while I mulled over this problem. I quickly realised that the interesting thing about HTML is that there is only a single precise parse tree - while it is possible for the same HTML string can be correctly produced in many different ways, the parse tree arising from this production is always identical. I realised that the reason for this is that all the possible non-terminal characters in this CFG are either left-handed / producers or right-handed / productions - in rules of the form A → BC, any given non-terminal character either always appears as B or always appears as C. This sets much greater constraints on how a string can be formed. For example, if the string begins with a right-handed character C, then you know that the string cannot be formed because there is no left-handed B to produce at the same time.
In addition, if the rules are formulated so that BC is a unique key in production rules, then each two-character combination can only be formed in one way, from A (or not at all). So, when parsing, if you encounter BC, you KNOW you can INSTANTLY reduce that to A and continue to parse, no questions asked, no problems raised. Or, if you encounter a combination DE, where the handednesses match but there is no production rule, you KNOW that that combination can NEVER be produced, so you can instantly throw an error. As a result, you can now parse the string in linear time.
I implemented this last night. I can pin this down in more specific terms if anybody cares, but I'm pretty sure I've simply independently discovered one of these (though at first glance it's not easy to tell which one).
You can now use <br /> line breaks, and adding more formatting rules is now trivial for me (if I ever decide to).
I've also added some cunning .htaccess and PHP redirection rules to harmonise and homogenise URLs. URLs of the form "http://qntm.org/index.php?destroy" and "http://qntm.org/?destroy" will still work, but give permanent redirects to the canonical "http://qntm.org/destroy" page. This little niggling extra character in URLs has been something I've wanted to excise for at least a year, but never got around to it.
The secret? Use .htaccess to redirect to a PHP script, then handle the request URI using nice easy PHP. Much easier than grappling with Apache directives.