Topic: Some notes and miscellaneous things
Independent of character encoding of input and does not affect it
Though, in practice, this can work due to the fact that almost every modern encoding in common use today uses ASCII as a base character set, this assertion raises eyebrows: It is trivially easy to inject invalid characters/byte sequences into the output, especially for multibyte encodings. Thus, the onus is on the developer to perform proper encoding checks, something most people don't do. If this is not handled properly, you CANNOT safely put the HTML into an XML file. Scope creep, perhaps, but I am a firm believer in giving people a full solution. Also, it is a bit worrying for some of the more quirky multibyte languages.
output can be used in HTML 4
It seems that htmLawed will invariably produce <br />s, which are not proper for HTML output.
I don't know. Is this a feature? :-P
removes soft-hyphen character (code-point 173 or #xad) in attribute values -- a vulnerability in some versions of the Opera browser
This is quite curious. Can I have a link? Also, the terminology used here is imprecise: although the soft-hyphen is U+00AD, it is only represented as the corresponding byte in the ISO 8859 character sets; in UTF-8 the byte sequence is 0xC2AD, etc.
understands improperly spaced tag content (like, spread over more than a line) and properly spaces them
Tags are allowed to span over multiple lines. There is nothing improper about this. Consult the XML parsing spec on how to deal with attribute values with newlines in them.
There's probably more, but that's it for now.
My biggest admiration of the code is that it managed to steal much of HTML Purifier's functionality while being fast. Bugger that you didn't follow through with CSS and attribute values. My biggest problem with the code is that it's... how shall I put it: ugly. Even if it was just put through an automated indenter/formatter it would look much better. There's lots of short, unexplained variable names that help make the code indecipherable.
On another note, the demo is vulnerable to POST-based cross-site scripting. I'm also leery about user-specified regexes; you can easily DDOS an application that way.