1 (edited by Ambush Commander 2007-11-03 00:43:05)

Topic: Some notes and miscellaneous things

Independent of character encoding of input and does not affect it

Though, in practice, this can work due to the fact that almost every modern encoding in common use today uses ASCII as a base character set, this assertion raises eyebrows: It is trivially easy to inject invalid characters/byte sequences into the output, especially for multibyte encodings. Thus, the onus is on the developer to perform proper encoding checks, something most people don't do. If this is not handled properly, you CANNOT safely put the HTML into an XML file. Scope creep, perhaps, but I am a firm believer in giving people a full solution. Also, it is a bit worrying for some of the more quirky multibyte languages.

output can be used in HTML 4

It seems that htmLawed will invariably produce <br />s, which are not proper for HTML output.

non-OOP code

I don't know. Is this a feature? :-P

removes soft-hyphen character (code-point 173 or #xad) in attribute values -- a vulnerability in some versions of the Opera browser

This is quite curious. Can I have a link? Also, the terminology used here is imprecise: although the soft-hyphen is U+00AD, it is only represented as the corresponding byte in the ISO 8859 character sets; in UTF-8 the byte sequence is 0xC2AD, etc.

understands improperly spaced tag content (like, spread over more than a line) and properly spaces them

Tags are allowed to span over multiple lines. There is nothing improper about this. Consult the XML parsing spec on how to deal with attribute values with newlines in them.

There's probably more, but that's it for now.

My biggest admiration of the code is that it managed to steal much of HTML Purifier's functionality while being fast. Bugger that you didn't follow through with CSS and attribute values. My biggest problem with the code is that it's... how shall I put it: ugly. Even if it was just put through an automated indenter/formatter it would look much better. There's lots of short, unexplained variable names that help make the code indecipherable.

On another note, the demo is vulnerable to POST-based cross-site scripting. I'm also leery about user-specified regexes; you can easily DDOS an application that way.

2 (edited by patnaik 2007-11-03 07:08:13)

Re: Some notes and miscellaneous things

non-OOP code: a 'feature' for PHP novices (including programmers of other languages wishing to port htmLawed) and functional programming enthusiasts. Also, considering htmLawed's objective, results in shorter and faster code.

soft-hyphen: a carry-over from KSES code. I too was not sure of continuing this 'feature' but didn't see any harm in having the character removed from tag content. See this, re: Firefox (http://www.bihforum.com/showthread.php?t=4593), and this, re: Opera (http://ha.ckers.org/xss.html). Re: soft-hyphen as '#xAD', as you state, xAD is the Unicode codepoint, and I think it is the 'xAD' sequence that is the cause of the security vulnerability.

improperly spaced tag content: ? wrong choice of word. But tag content often is inadvertently 'improperly' spaced, e.g., in form-based inputs with limited textarea widths, and in text-editors that auto hard-wrap.

ugly code: htmLawed is a small utility to be used internally inside other software, and once stabilized is very unlikely to be modified. Thus the compact code which, once grasped, is quickly readable. I will add a readme file for developers.

attribute value checks: in some cases, like URLs (including those in 'style'), dynamic expressions in 'style', 'empty' attributes, 'id', etc., attribute values are checked. Checking is not done for all cases to keep htmLawed speedy. Checking has two aspects: standard-compliance (e.g., for length units), and administrator-compliance (e.g., length values). Even if attribute values don't adhere to standards, user-agents almost always display documents well enough and without security vulnerabilities (which htmLawed anyway attempts to remove). For administrator-compliance, like 'class' values being one of a certain set of values, htmLawed accepts such rules in its function arguments.

demo vulnerable: XSS, yes, but the input is not stored for later use and the only person vulnerable should be the scripter himself. Regex DDoS: user-supplied regex is used only for preg_match.

<br>: I haven't checked but HTML 4 docs using '<br />' instead of '<br>' should be rendered okay by not too-old user-agents

character encoding: unlike, e.g., the excellent HTML Purifier, htmLawed doesn't do encoding checks for malformed/invalid byte sequences. htmLawed is certainly not 'complete'. (that also means it can be more practical, and less resource-intensive or finicky). Encoding-checks can be performed beforehand using other means before text is passed to htmLawed.

3

Re: Some notes and miscellaneous things

a carry-over from KSES code. I too was not sure of continuing this 'feature' but didn't see any harm in having the character removed from tag content. See this, re: Firefox, and this, re: Opera. Re: soft-hyphen as '#xAD', as you state, xAD is the Unicode codepoint, and I think it is the 'xAD' sequence that is the cause of the security vulnerability.

I have absolutely no sympathy for those still running Firefox 1.0.6. As for the Opera exploit, it's a very specific one that you should test for, rather than nuking all soft-hyphens. Soft hyphens are very important for users of languages where there are no clearly defined word boundaries.

htmLawed is a small utility to be used internally inside other software, and once stabilized is very unlikely to be modified. Thus the compact code which, once grasped, is quickly readable. I will add a readme file for developers.

I await such a readme file. :-)

XSS, yes, but the input is not stored for later use and the only person vulnerable should be the scripter himself.

See this: http://www.whiteacid.org/misc/xss_post_forwarder.php and this page (http://www.whiteacid.org/misc/xss_post_forwarder.php?xss_target=http://babelfish.altavista.com/tr&doit=done&intl=1&tt=urltext&trtext=This+is+a+test&lp=en_de&btnTrTxt=Translate) which provides an example of auto-submitting JavaScript code. Even POST XSS is dangerous, though less so than GET.

I haven't checked but HTML 4 docs using '<br />' instead of '<br>' should be rendered okay by not too-old user-agents

It'll be rendered properly (in fact, we use <br /> instead of <br/> specifically for that reason). However, it actually means something different in an HTML context:

W3C wrote:

The sequence <FOO /> can be interpreted in at least two different ways, depending on the DOCTYPE of the document. For HMTL 4.01 Strict, the '/' terminates the tag <FOO (with an implied '>'). However, since many browsers don't interpret it this way, even in the presence of an HMTL 4.01 Strict DOCTYPE, it is best to avoid it completely in pure HTML documents and reserve its use solely for those written in XHTML.