Since UTF-8 is a complex beast to validate I've made some slight further changes to the parser that's used to validate comments here on the site. One of the great things about HTML is how easy it is to parse and how many flexible approaches there are. I've managed to come up with a way which while working in linear time also involves quite a lot of raw data to specify exactly what is permitted and what isn't. This parser is clever, though, because it parses the text literally one character at a time, which means that I can seamlessly combine it with my new UTF-8 validating routines instead of having to rely on PHP's abysmal UTF-8 support, such as the mb_substr() function which utterly fails to notice when invalid UTF-8 is passed to it. I deliberately set out to avoid the use of magic numbers and to make the routine as simple to parse as possible.
And then I also remembered that other than my Perl Snake Cube solver which I presented a few weeks ago, it's been years since I actually put any code of mine online. With this in mind I decided that you folks might be interested in taking a look at my parser and seeing if you think it's good code and whether it has any obvious defects.
Some things I will say right away
This is not the precise code used in the parser, that has better integration, this is just so you can see how it works. Uncomment lines at the top to try out the various failure modes.
Sorry, there aren't any test cases for the HTML validation, figure those out yourself.
This parser allows use of only a small whitelist of HTML tags and character entities, and attributes on those tags are not permitted, although whitespace is acceptable. This small whitelist nevertheless gives rise to a lot of raw data making up the formal grammar rules at the top - while this is necessary, it should be possible to generate these rules automatically from a simpler set of constraints, rather than programming them all in manually. Otherwise, broadening the rules to allow tag attributes and new tags becomes very difficult.
The parser is "slow" to detect problems. For example, the broken HTML string "<h5><p" will prompt you to fully finish the incomplete paragraph tag before realising that you are, in fact, trying to put a paragraph tag inside an <h5>, which is not allowed. A smarter parser should recognise that there is only one way to finish a "<p" and jump to the conclusion that even starting off this way is not okay. This would require a little looking ahead, but possibly impair performance.
As ever, please go ahead and try to break validation in the comments below.
Discussion (9)
2010-04-01 01:40:57 by JeremyBowers:
2010-04-01 01:43:29 by JeremyBowers:
2010-04-01 02:51:40 by Lar:
2010-04-01 03:27:39 by Lar:
2010-04-10 19:00:02 by Som:
2010-04-10 19:09:19 by S:
2010-04-10 22:49:44 by Fjord:
2010-04-10 23:36:39 by Sam:
2010-05-05 20:21:41 by Sam:
add comment