1

Topic: htmLawed UTF-8 bug (soft-hyphen-related)

Dear all,

  the following input causes malformed (garbage characters) output and can be easily reproduced in the HtmLawed online test:

<p><a href="http://www.google.cz/search?q=í">test</a></p>

From the user's point of view, this is a valid input a HtmLawed should not malform the output, whether or not the URL is correctly URL-encoded.

2

Re: htmLawed UTF-8 bug (soft-hyphen-related)

Thanks for reporting this issue. It indeed is a bug and I will come up with a fix soon.

As a security feature, htmLawed looks for the soft-hyphen character in URL values and converts it to a plain space character. Soft-hyphens can be used to obfuscate URLs for malicious purpose (see http://www.symantec.com/connect/blogs/soft-hyphen-new-url-obfuscation-technique).

The byte representation for the "í" character in the example given by you has AD in it. Because the AD string ("\xAD") also represents the soft-hyphen character, the "í" character gets malformed following the removal of AD from it by htmLawed.

3

Re: htmLawed UTF-8 bug (soft-hyphen-related)

A new version of htmLawed (1.1.19, 19 Jan. 2015) that attempts to address this issue has been released.

4

Re: htmLawed UTF-8 bug (soft-hyphen-related)

Thank you very much for such a quick reply and fix - it now works great. Thanks again