1

Topic: Possible bug that may lead to security issues

The following HTML, when validated by htmlawed, returns a strange result:

<a href="http://xyz.com/>" target="_blank" rel="nofollow">Link</a> 

(note the > on the href value) is parsed to something like:

<a>" target="_blank" rel="nofollow"&gt;http://xyz.com/&gt;</a>;

I used the latest stable htmlawed for this test.

2

Re: Possible bug that may lead to security issues

This is strange... the output that I get with the current release of htmLawed default settings or with 'safe' parameter set to 1 is:

<a>" target="_blank" rel="nofollow"&gt;Link</a>

You can test this at the htmLawed test site.

Can you re-confirm your output?

3

Re: Possible bug that may lead to security issues

I can. I'm using 'safe' => 1, 'abs_url' => 1, 'keep_bad' => 2, 'make_tag_strict' => 0, and

'a' => array('-*',
    'href(maxlen=300/minlen=1)',
    'title(maxlen=300/minlen=0)',
    'class(maxlen=50/minlen=1/match=%^wysiwyg-%)'
),

Even so, this result, your version or mine, is strange. ">" inside the href should not be considered the end of the <a> element. I'd expect the result to be exactly the same as the input, or at most have the ">" character escaped in the URL, certainly not output " target="_blank" rel="nofollow" in the text.

4

Re: Possible bug that may lead to security issues

I still don't get the 'strange' output with the 'config.' and 'spec.' settings that you mention. It appears that there is some non-htmLawed factor at play here.

--

I do not think this is a security issue.

But I agree with you that htmLawed should be able to pick up the 'href' attribute name-value pair in the malformed HTML. I will look into the code to fix it.

5

Re: Possible bug that may lead to security issues

I agree. After thinking better I can't see how this could be exploited, so this is not a security problem.

I tried to replicate it on the test site, but I could not. It always returns the same output as you mentioned. I'll try to see if I can isolate what is different here. Either way, if you can fix the issue to accept ">" the difference seems irrelevant.

Thank you very much for the reply (and hopefully for the future fix)!

6

Re: Possible bug that may lead to security issues

> if you can fix the issue to accept ">"

I am having second thoughts about this.

While htmLawed or any similar software can deal with malformed HTML to some degree, it cannot obviously handle ill-written HTML of any kind. You can imagine malformed HTML text in which the original intent of the author (the correct HTML) can be interpreted in more than one way.

It is possible to have logic in the code to decipher the 'most likely correct' HTML but it comes at a cost -- code size and complexity, and processing time -- and it still will not work 100% of the time.

So, for the time-being, I probably will not look to implement a 'fix' for the issue.

In case you are interested, roughly speaking, htmLawed starts by first breaking the input text into fragments and further processes those of type <xxx> (line 104, current version of htmLawed.php). htmLawed then reads the HTML element name and attribute name-value pairs in the fragment. It looks for xxx="xxx", xxx='xxx' and xxx=xxx patterns to identify the name-value pairs (line 476). This is where it filters out the URL in the example you have posted about. By making htmLawed accept xxx="xxx, it is possible to make it accept the URL value for 'href' but you can imagine scenarios where this can lead to 'misreading' of malformed HTML.

7

Re: Possible bug that may lead to security issues

I see.  Well, RFC 3986 says that > is not a valid character for a URL, so I can't complain about your decision.  And I understand how that change would influence the parser. Thanks!