1

Topic: UTF-8

Hi, htmlAwed looks very promising, but I have a doubt about how it handles utf-8. Many of the php regular expressions inside the code do not use the /u modifier, this seems to suggest that it will not handle utf-8 multibyte input properly, is that so?

2

Re: UTF-8

I haven't seen UTF8 issues with htmLawed.

The regular expressions in the code look for patterns with characters such as '&' and '>' which are ASCII characters. UTF-8 is ASCII compatible. 'In other words, single byte ASCII characters retain their encoded value in UTF-8. For example, code that checks for a '' can continue checking for the byte value 0x5C instead of changing the code to check for 0x005C.'

I think htmLawed would therefore work properly with UTF8. I have tested this, though only to a limited degree, with content from different 'international' web-pages.

3

Re: UTF-8

Even if all the problematic characters are ascii, that does not mean that htmlAwed will neccesarily work with utf-8. A regex like /a.{0,1}b/ will work with ascii but not with utf-8. Have you checked for this type of error in htmlAwed?

The problem might come from quantifiers in regular expressions because regular expressions that do not have the /u modifier consider one byte = one character, but that is not the case in utf-8.

Thank you!

4

Re: UTF-8

Thanks for your comment. I will look into this.

5

Re: UTF-8

The 'clean_ms_char'=>1 config option seems to corrupt utf-8 encoded text.

6

Re: UTF-8

The following regex, found in the htmlawed code, probably does not work correctly with utf-8 because it uses quantifiers without /u option, so it is counting bytes in the string, not chars.

while(preg_match('`(?<=/)([^/]{3,}|[^/.]+?|\.[^/.]|[^/.]\.)/\.\./`', $p)){

There is probably more in the code, I have not done an exhaustive search. This one might not be too bad because I think it operates on urls, and utf chars are unlikelly to be found here, except perhaps in malicious exploits?

7

Re: UTF-8

1. I have avoided use of the 'u' pattern modifier in the regular expressions because it is UTF-8-specific and may give wrong results for UTF-16, etc., which are valid encodings for web content.

2. The 'clean_ms_char' option can corrupt the input and so its appropriate use is advised. See, e.g., the htmLawed documentation or this post.

8

Re: UTF-8

Also see this topic.