UTF-8 (Page 1) — htmLawed — PHP Labware forum

1 Topic by Ripiajo 2009-11-24 19:38:12

Ripiajo
Member
Offline

Topic: UTF-8

Hi, htmlAwed looks very promising, but I have a doubt about how it handles utf-8. Many of the php regular expressions inside the code do not use the /u modifier, this seems to suggest that it will not handle utf-8 multibyte input properly, is that so?

2 Reply by patnaik 2009-11-26 10:35:40

patnaik
Administrator
Offline

Re: UTF-8

I haven't seen UTF8 issues with htmLawed.

The regular expressions in the code look for patterns with characters such as '&' and '>' which are ASCII characters. UTF-8 is ASCII compatible. 'In other words, single byte ASCII characters retain their encoded value in UTF-8. For example, code that checks for a '' can continue checking for the byte value 0x5C instead of changing the code to check for 0x005C.'

I think htmLawed would therefore work properly with UTF8. I have tested this, though only to a limited degree, with content from different 'international' web-pages.

3 Reply by Ripiajo 2009-11-26 19:38:44

Ripiajo
Member
Offline

Re: UTF-8

Even if all the problematic characters are ascii, that does not mean that htmlAwed will neccesarily work with utf-8. A regex like /a.{0,1}b/ will work with ascii but not with utf-8. Have you checked for this type of error in htmlAwed?

The problem might come from quantifiers in regular expressions because regular expressions that do not have the /u modifier consider one byte = one character, but that is not the case in utf-8.

Thank you!

4 Reply by patnaik 2009-11-27 05:10:02

patnaik
Administrator
Offline

Re: UTF-8

Thanks for your comment. I will look into this.

5 Reply by Ripiajo 2009-11-28 09:36:01

Ripiajo
Member
Offline

Re: UTF-8

The 'clean_ms_char'=>1 config option seems to corrupt utf-8 encoded text.

6 Reply by Ripiajo 2009-11-29 20:26:07

Ripiajo
Member
Offline

Re: UTF-8

The following regex, found in the htmlawed code, probably does not work correctly with utf-8 because it uses quantifiers without /u option, so it is counting bytes in the string, not chars.

while(preg_match('`(?<=/)([^/]{3,}|[^/.]+?|\.[^/.]|[^/.]\.)/\.\./`', $p)){

There is probably more in the code, I have not done an exhaustive search. This one might not be too bad because I think it operates on urls, and utf chars are unlikelly to be found here, except perhaps in malicious exploits?

7 Reply by patnaik 2009-12-16 21:52:59

patnaik
Administrator
Offline

Re: UTF-8

1. I have avoided use of the 'u' pattern modifier in the regular expressions because it is UTF-8-specific and may give wrong results for UTF-16, etc., which are valid encodings for web content.

2. The 'clean_ms_char' option can corrupt the input and so its appropriate use is advised. See, e.g., the htmLawed documentation or this post.

8 Reply by patnaik 2012-09-17 23:32:55

patnaik
Administrator
Offline

Re: UTF-8

Also see this topic.

[ Closed ] UTF-8

Posts: 8

1 Topic by Ripiajo 2009-11-24 19:38:12

Topic: UTF-8

2 Reply by patnaik 2009-11-26 10:35:40

Re: UTF-8

3 Reply by Ripiajo 2009-11-26 19:38:44

Re: UTF-8

4 Reply by patnaik 2009-11-27 05:10:02

Re: UTF-8

5 Reply by Ripiajo 2009-11-28 09:36:01

Re: UTF-8

6 Reply by Ripiajo 2009-11-29 20:26:07

Re: UTF-8

7 Reply by patnaik 2009-12-16 21:52:59

Re: UTF-8

8 Reply by patnaik 2012-09-17 23:32:55

Re: UTF-8

Posts: 8