1

Topic: Suggesting change: handling of multibyte chars in attribute values

Hi as we are introducing htmLawed into EGroupware we ran into a problem regarding the handling of multibyte chars in attribute values of tags in mbstring.func_overload enviroments, as substr_replace does not have a multibyte func overload function associated. Thus substr_replace($a, '', 0, strlen($m)) will fall short in the above mentioned enviroments.

I would like to suggest the following change to handle the situation by providing (and using) a hl_bytes function to be used instead of strlen. Please consider kindly if applying such a change to an upcomming release seems to be worthwhile to the project.

--- htmLawed.php    (Revision 39326)
+++ htmLawed.php    (Arbeitskopie)
@@ -471,7 +471,7 @@
    }
   break; case 2: // Val
    if(preg_match('`^"[^"]*"`', $a, $m) or preg_match("`^'[^']*'`", $a, $m) or preg_match("`^\s*[^\s\"']+`", $a, $m)){
-    $m = $m[0]; $w = 1; $mode = 0; $a = ltrim(substr_replace($a, '', 0, strlen($m)));
+    $m = $m[0]; $w = 1; $mode = 0; $a = ltrim(substr_replace($a, '', 0, hl_bytes($m)));
     $aA[$nm] = trim(($m[0] == '"' or $m[0] == '\'') ? substr($m, 1, -1) : $m);
    }
   break;
@@ -684,6 +684,20 @@
 // eof
 }
 
+/**
+ * Return the number of bytes of a string, independent of mbstring.func_overload
+ * AND the availability of mbstring
+ *
+ * @param string $str
+ * @return int
+ */
+function hl_bytes($str)
+{
+static $func_overload;
+if (is_null($func_overload)) $func_overload = extension_loaded('mbstring') ? ini_get('mbstring.func_overload') : 0;
+return $func_overload & 2 ? mb_strlen($str,'8bit') : strlen($str);
+}
+

2 (edited by patnaik 2012-06-04 21:29:06)

Re: Suggesting change: handling of multibyte chars in attribute values

Thanks for pointing out this issue. I have looked into it and can think of two other ways besides your suggestion to address it.

One is to not change the htmLawed code but instead suggest administrators to implement a good 'substr_replace' function for mbstring.func_overload, like this one -- http://www.php.net/manual/en/function.substr-replace.php#90146.

The other is to change the two instances of 'substr_replace' in the htmLawed code to use 'substr' instead. E.g.,

//instead of
substr_replace($a, '', 0, strlen($m))

// use
$position_needle = strpos($a, $m);
substr($a, 0, $position_needle). substr($a, $position_needle+strlen($m))

Edited: In my test, both your suggestion and the 'substr' idea increase the htmLawed processing time ~5%-10%.

3

Re: Suggesting change: handling of multibyte chars in attribute values

Changing lines 474-475 to the following seems to be the best option:

if(preg_match('`(^("[^"]*")|(\'[^\']*\')|(\s*[^\s"\']+))(.*)`', $a, $m)){
    $a = ltrim($m[5]); $m = $m[1]; $w = 1; $mode = 0;

I have tested this with some multi-byte attribute values. If possible, can you post some of the problematic inputs that you encountered? Thanks.

4

Re: Suggesting change: handling of multibyte chars in attribute values

The new, 1.1.11 version of htmLawed attempts to address this multi-byte issue in a fashion similar to the one in my last post. It also has a minor feature enhancement.

5

Re: Suggesting change: handling of multibyte chars in attribute values

Cool Thanks. As answer to yesterdays post; This was one of the input:

<img alt="äöüßklautzi" src="/egroupware/webdav.php/home/Familie/Ausländerausweis_KS.jpg" style="height: 462px; width: 350px;" />

6

Re: Suggesting change: handling of multibyte chars in attribute values

Thanks for the input example.

Strangely, in my PHP setup, I do not find any issue with this input when using the older htmLawed version (1.1.10) with default config.

/*Details of the setup:

OS: Mac OS X 10.6.8
Application: MAMP 2.0.5
PHP: version 5.3.6

Multibyte Support - enabled
Multibyte string engine - libmbfl
HTTP input encoding translation - disabled

Multibyte (japanese) regex support - enabled
Multibyte regex (oniguruma) backtrack check - On
Multibyte regex (oniguruma) version - 4.7.1

Directive    Local Value    Master Value
mbstring.detect_order    no value    no value
mbstring.encoding_translation    Off    Off
mbstring.func_overload    7    7
mbstring.http_input    pass    pass
mbstring.http_output    pass    pass
mbstring.http_output_conv_mimetypes    ^(text/|application/xhtml\+xml)    ^(text/|application/xhtml\+xml)
mbstring.internal_encoding    no value    no value
mbstring.language    neutral    neutral
mbstring.strict_detection    Off    Off
mbstring.substitute_character    no value    no value
*/

7

Re: Suggesting change: handling of multibyte chars in attribute values

The url itself was passed fine through the parsing process, but the following style attributes where messed up because the remaining string kstarted with

" style="...

so the next parsing step for the attribute failed (style was expected) -> thus the resizing failed.

By the way I noticed, that within my setup/enviroment ctype_digit failed to end the self defined hook_tag function
as ctype_digit fails to return true on integers (as the new default on the examples in attribute_array is 0, this may happen.)
I replaced the ctype_digit with is_numeric.

    // If second argument is not received, it means a closing tag is being handled
    if(is_numeric($attribute_array)){
        return "</$element>";
    }

8

Re: Suggesting change: handling of multibyte chars in attribute values

Thank you for pointing out the problem with using 'ctype_digit', which will return FALSE for integers. I will edit the documentation to suggest use of 'is_numeric'.

I still cannot replicate the multi-byte issue, though (or may be I did not understand the issue). That is, the output I get is what is expected, with no mangling of attribute names/values:

<img alt="äöüßklautzi" src="/egroupware/webdav.php/home/Familie/Ausländerausweis_KS.jpg" style="height: 462px; width: 350px;" />

9

Re: Suggesting change: handling of multibyte chars in attribute values

I do not know, why this does not show. my enviroment may be slightly different.
If I go back to the previous version I can reproduce that.
I post the error log of an a-href working, and the one posted earlier.

[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:463 a:alt="" src="/egroupware/webdav.php/home/WbDownload/kopf_zeitungen.jpg" style="width: 100px; height: 30px;"-> m:Array([0] => alt), referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:465 new a:="" src="/egroupware/webdav.php/home/WbDownload/kopf_zeitungen.jpg" style="width: 100px; height: 30px;"-> now switching to mode:1, referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:475 a:"" src="/egroupware/webdav.php/home/WbDownload/kopf_zeitungen.jpg" style="width: 100px; height: 30px;"-> m:Array([0] => alt), referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:479 setting aA with key:alt->'', referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:480 new a:src="/egroupware/webdav.php/home/WbDownload/kopf_zeitungen.jpg" style="width: 100px; height: 30px;"-> switching back to mode0, referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:463 a:src="/egroupware/webdav.php/home/WbDownload/kopf_zeitungen.jpg" style="width: 100px; height: 30px;"-> m:Array([0] => src), referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:465 new a:="/egroupware/webdav.php/home/WbDownload/kopf_zeitungen.jpg" style="width: 100px; height: 30px;"-> now switching to mode:1, referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:475 a:"/egroupware/webdav.php/home/WbDownload/kopf_zeitungen.jpg" style="width: 100px; height: 30px;"-> m:Array([0] => src), referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:479 setting aA with key:src->'/egroupware/webdav.php/home/WbDownload/kopf_zeitungen.jpg', referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:480 new a:style="width: 100px; height: 30px;"-> switching back to mode0, referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:463 a:style="width: 100px; height: 30px;"-> m:Array([0] => style), referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:465 new a:="width: 100px; height: 30px;"-> now switching to mode:1, referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:475 a:"width: 100px; height: 30px;"-> m:Array([0] => style), referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:479 setting aA with key:style->'width: 100px; height: 30px;', referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:480 new a:-> switching back to mode0, referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo

The a-href (broken hl_tag processing) posted earlier

[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:463 a:alt="\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9fklautzi" src="/egroupware/webdav.php/home/Familie/Ausl\xc3\xa4nderausweis_KS.jpg" style="height: 462px; width: 350px;"-> m:Array([0] => alt), referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:465 new a:="\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9fklautzi" src="/egroupware/webdav.php/home/Familie/Ausl\xc3\xa4nderausweis_KS.jpg" style="height: 462px; width: 350px;"-> now switching to mode:1, referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:475 a:"\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9fklautzi" src="/egroupware/webdav.php/home/Familie/Ausl\xc3\xa4nderausweis_KS.jpg" style="height: 462px; width: 350px;"-> m:Array([0] => alt), referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:479 setting aA with key:alt->'\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9fklautzi', referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:480 new a:tzi" src="/egroupware/webdav.php/home/Familie/Ausl\xc3\xa4nderausweis_KS.jpg" style="height: 462px; width: 350px;"-> switching back to mode0, referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:463 a:tzi" src="/egroupware/webdav.php/home/Familie/Ausl\xc3\xa4nderausweis_KS.jpg" style="height: 462px; width: 350px;"-> m:Array([0] => tzi), referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:465 new a:" src="/egroupware/webdav.php/home/Familie/Ausl\xc3\xa4nderausweis_KS.jpg" style="height: 462px; width: 350px;"-> now switching to mode:1, referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:463 a:width: 350px;"-> m:Array([0] => width:), referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo
[Tue Jun 12 09:27:33 2012] [error] [client 10.44.44.94] hl_tag:465 new a:350px;"-> now switching to mode:1, referer: http://10.44.44.93/egroupware/index.php?menuaction=wiki.wiki_ui.view&page=MoreThanThisToo

But as the problem is fixed with the recent version, I think any more effort putting into it, is not helping, but if you insist, I can try to figure out the differences in setup, compare to the stuff posted earlier in "Details of the setup"

10

Re: Suggesting change: handling of multibyte chars in attribute values

Thanks for looking into this. No, there is no need to investigate further.