1

Topic: Problems with UTF-8?

Hello,
I'm wondering if htmLawed can have problems like content corruption handling utf-8 content; according to this page http://www.phpwact.org/php/i18n/utf-8 (quite old, but I think still useful) there are some functions which can produce bad results with utf-8 content and they don't have an mbstring equivalent

1) Some functions such as explode, str_replace seems to be Ok if the content is well formed, so first question is: does htmLawed check for utf-8 well formedness?

2) Other functions (without an mbstring equivalent) can also have problems with well formed utf-8 (and they are used in htmLawed) e.g. :
- substr_replace
- preg_replace with \w or \b
- strcspn

In particular here

$y = !$x ? ltrim($e, '/') : ($x > 0 ? substr($e, 0, strcspn($e, ' ')) : 0);

I think there could be a problem if the mbstring overload is ON because substr would count characters and strcspn would count bytes.

Thanks!

2

Re: Problems with UTF-8?

No, htmLawed does not check for well-formedness of UTF-8 (or other character encoding). I feel this is something beyond a text filter's purview.

With well-formed UTF-8, I have tested a variety of inputs without noticing an issue. Only once have I heard of a possible bug because of the use of a function without an mbstring equivalent (I could not reproduce the issue though; see this topic).

So, it may be that, theoretically, htmLawed will have issues with some types of UTF-8 input. But I do not plan to modify the htmLawed code until I can see/reproduce such an issue.

3

Re: Problems with UTF-8?

Thanks for the reply.

About well-formedness: I don't think it is a very difficult task to get, I think there are simple methods to check it (at least for utf-8, which is very common) that you can apply before any other operations. Just a suggestion.

Beside well formedness, in general, I think there could be two kinds of problems:

1) if you are parsing utf-8 content, is htmLawed working even if you use the standard string functions? I mean strlen, strpos,strtolower,substr can't correctly handle multibyte text, they need the corresponding mb_strlen, mb_strpos and so on.... I don't know the library enough to imagine an example which can cause problems, but is there any reason why you don't use mbstring functions (with a fallback to the standard ones in case mbstring is not available)? Here I'm talkning about a situation where mbstring overload is not enabled.

2) Regardless of 1), if for some reason the user has enabled mbstring overload, combination of functions which work on a byte level and functions which work on a character level (as the one in the example I posted) for sure is risky. Again, I don't know the library enough to produce a real-world example, but on the use of substr together with strcspn a simple example like this one highlights what can happen:

$e = 'ユニコード ユニコード';
echo '<p>'.substr($e, 0, strcspn($e, ' ')); // normal behaviour
echo '<p>'.mb_substr($e, 0, strcspn($e, ' '), 'utf-8'); // overload behaviour

About regular expressions, I'm checking an old version, I'm not sure but I think that this one:

$t = str_replace(' </', '</', preg_replace(array('`(<\w[^>]*(?<!/)>)\s+`', '`\s+`', '`(<\w[^>]*(?<!/)>) `'), 
array(' $1', ' ', '$1'), preg_replace_callback(array('`(<(!\[CDATA\[))(.+?)(\]\]>)`sm', '`(<(!--))(.+?)(-->)`sm', 
'`(<(pre|script|textarea)[^>]*?>)(.+?)(</\2>)`sm'), create_function('$m', 'return $m[1]. str_replace(array("<", ">", 
"\n", "\r", "\t", " "), array("\x01", "\x02", "\x03", "\x04", "\x05", "\x07"), $m[3]). $m[4];'), $t)));

Could be the only one which can give problem; what do you think about? Is there any way to avoid the use of \w?

Finally, about the topic you mentioned, I think there is another use of substr_replace + strlen which can be dangerous, am I right?

$w = $mode = 1; $a = ltrim(substr_replace($a, '', 0, strlen($m[0])));

Thanks again and sorry for the long message.

4

Re: Problems with UTF-8?

Thank you for the detailed feedback. I will post my thoughts soon.

5

Re: Problems with UTF-8?

I have tested htmLawed with a variety of test inputs that have non-English characters from a number of languages/scripts encoded in UTF-8. In recent days, this has been with PHP 5.3 with mbstring.func_overload = 0.

Here's an example input (the characters may not display properly, depending on the browser, OS, etc.):

<ruby xml:lang="ja">
  <rbc>
    <rb>斎</rb>
    <rb>藤</rb>
    <rb>信</rb>
    <rb>男</rb>
  </rbc>
  <rtc class="reading">
    <rt>さい</rt>
    <rt>とう</rt>
    <rt>のぶ</rt>
    <rt>お</rt>
  </rtc>
</ruby>

᚛᚛ᚉᚑᚅ <img src="s" alt="ユニコード ユニコード" />
<img src="s" alt="jeść szkło" /> €

<br /><hr /><table><tr><td>

<div title="  სტრაცია">  სტრაცია<!-- comment אני יכול לאכול זכוכית וזה לא מזיק לי.-->
<p>Inscrieţi-vă acum la a Zecea Conferinţă</p>
<ol><li><a name="อ.อ่าง  ">อ.อ่าง</a><a name="  Кругом шумел.">Кругом шумел.</a></li></ol>

</div></td><td>
<div title="⠊⠀⠉⠁⠝⠀⠑⠁⠞⠀⠛⠇⠁⠎⠎⠀">⠊⠀⠉⠁⠝⠀⠑⠁⠞⠀⠛⠇⠁⠎⠎⠀
<p>أنا قادر على أكل الزجاج و هذا لا يؤلمني</p>
<ol><li><a name=" 能吞下玻璃而不 ">能吞下玻璃而不</a><a name="나는 유">나는 유</a></li></ol>

</div></td><td>
<div title=" хортой биш"> хортой биш
<p>أنا قادر على أكل الزجاج و هذا لا يؤلمني</p>
<ol><li><a name=" ᠪᠢ ᠰᠢᠯᠢ ᠢᠳᠡᠶᠦ ᠴᠢᠳᠠᠨᠠ">ᠪᠢ ᠰᠢᠯᠢ ᠢᠳᠡᠶᠦ ᠴᠢᠳᠠᠨᠠ</a><a name="Կրնամ ապակի ուտել   
">Կրնամ ապակի ուտել</a></li></ol>

</div></td></tr></table>


Inscrieţi-vă acum la a Zecea Conferinţă Internaţională
გთხოვთ ახლავე გაიაროთ რეგისტრაცია
večjezično računalništvo
на Десятую Международную Конференцию
see it <hi>

I use the htmLawed test page, and compare input and output by eye and with the 'Diff' tool of the test page. I do not see any issue as yet. I have manipulated the input so that there are errors which htmLawed has to correct (such as white-spaces flanking attribute values), and I have tried different htmLawed configurations (e.g., to enable 'tidying' that triggers use of single-byte string functions).

I will continue to test. It is very much possible that it's a matter of time before I encounter an input that htmLawed will fail to handle correctly.

There are at least some parts of htmLawed where, logically, the use of a single-byte function, or '\w', will not cause a problem. Thus, regarding the '\w' you talk of in your last post, the regular expression pattern is to look for HTML tags and names of HTML elements can only start with a character from A-Z.

--

At least in the near term I don't intend to add to htmLawed the ability to check/correct character well-formedness. Again, I am of the opinion that this is beyond htmLawed's role. Besides this, there are also a number of encodings to consider. There will be significant effects on htmLawed's code-size and speed. But, thanks for suggesting this; I know that inclusion of such a feature will be appreciated by some.

6

Re: Problems with UTF-8?

Hi,

I looked more in detail into the very last version, here are my two cents about making it utf-8 compatible.

1) regex: the problematic parameters (u i w W b B) you have used seem to be i (case insensitive) and w (word), but since you (I think) are always using ASCII characters in the patterns, I guess there is no problem. Just a few cases I'm not sure about because there are variables or for other reasons:

break; case 'match': if(!preg_match($v, $t)){$o = 0;}
break; case 'nomatch': if(preg_match($v, $t)){$o = 0;}

preg_match('`^([a-z\d\-+.&#; ]+?)(:|&#(58|x3a);|%3a|\\\\0{0,4}3a).`i', $p, $m)

preg_match($p, '');

if(!empty($r1) && preg_match($r1, $v)){continue;}

if(!empty($r0) && preg_match($r0, $v)){

Can you confirm in these cases there are just ASCII and/or no problematic parameters?

2) strcspn can be risky but can be change to preg_match + substr

$y = !$x ? ltrim($e, '/') : ($x > 0 ? substr($e, 0, strcspn($e, ' ')) : 0);

3) Here you can just to what you did in the other case, change substr_replace to strpos + substr to be safe

$w = $mode = 1; $a = ltrim(substr_replace($a, '', 0, strlen($m[0])));

4) All the string functions you used which have an mbstring equivalent (substr, strtolower, strlen and strpos) can be simply changed to substr_custom, strtolower_custom and so on....the _custom functions would contain just something like

if (function_exists(mb_substr)) {
    return mb_substr($input);
}
else{
    return substr($input);
}

Of course saying that without mbstring there could be, theoretically, problems.

Finally there is the well-formedness problem; here:  http://www.phpwact.org/php/i18n/charsets#checking_utf-8_for_well_formedness
there are some solutions to check....you are right when you say there are other encoding too, but supporting utf-8 you can cover the vast majority of modern web applications, and I guess more and more in the future...

7

Re: Problems with UTF-8?

Thanks again for the various comments and suggestions. I have now done a kind of audit of the htmLawed code.

Including those that you mention, there are about 50 instances of use of PHP's PCRE regular expression-related functions ('preg' in function name) in htmLawed's code (there are no 'ereg' functions). There are four such functions in use: preg_match, preg_replace, preg_replace_callback and preg_split. All instances deal with parts of input text for HTML attribute names, HTML element names, HTML entities, URL protocols (like http), white spaces, other low ASCII characters (like NUL), and values set by the administrator for configurable htmLawed parameters. That is, the characters looked for are all below ASCII decimal code-point 128.
    Therefore, their functionality is encoding-independent and should be maintained if instead of the traditional ASCII (7-bit bytes for 128 code-points), encodings that use 8-bit bytes for 256 code-points, such as national variants of ASCII (such as ISO-646-DE/German of the ISO 646 standard), extended ASCII variants (such as ISO 8859-10/Turkish of the ISO 8859/ISO Latin standard), ISO 8859-based Windows variants (such as Windows 1252), EBCDIC, Shift JIS, GB-Roman (Chinese), and KS-Roman (Korean), or ASCII-compatible encodings like UTF-7 (Unicode), UTF-8 (Unicode), EUC-CN (Chinese) and EUC-TW (Taiwanese) are used. With these encodings, the same correspondence exists between characters and code-points for the specific characters that htmLawed bases its logic on (such as 0-9, a-z, ':', '<', '>', and '&', ).

There are about 100 instances of use of PHP's string functions ('str' in function name) in htmLawed's code. There are nine such functions in use: str_repeat, str_replace, strcspn, strlen, strpos, strtolower, substr, substr_replace, strtr. I am sure that character encoding is not an issue for any of them, because of the nature of the arguments received by the functions (for purposes such as those mentioned above).

E.g., with the string functions in the code snippet in point #2 of your last post, an HTML element's name is being identified from an opening tag. The tag may have attribute name-value pair(s). So the element is identified as the sub-string before the first occurrence of a space character. The sub-string, being an element name, can only have a-z and 1-6 characters. Thus, to answer your specific question in point #2, there should not be an issue if a non-(US) ASCII encoding like the one listed above is used (including UTF-8).

Similarly, there should not be an issue with the string functions in the code snippet in point #3 of your last post. In the instance where it is used, a potential attribute name (which by definition can only have a-z, colon and hyphen characters, and by logic can only be at the beginning of a string) has been identified and the string is being shortened by removing the attribute name at its front.

--

Overall, my impression is that there is no need to use PHP's multi-byte string functions, or allow for their use when available, unless encoding-related issues are discovered and there is no other good option (unlikely).

There are likely to be issues with ASCII-incompatible encodings such as double 8-bit-byte-using UTF-16 (Unicode), Big Five (Chinese), JIS X 0208:1997 (Japanese) and K SX 1001:1992 (Korean), and quadruple 8-bit-byte-using UTF-32 (Unicode). Here are some examples of the instances where such issues can arise (line numbers of htmLawed.php, version 1.1.14)

Line 502:  $v = strtr($v, $sC);

Line 507:  $v = str_replace("\xad", ' ', (strpos($v, '&') !== false ? str_replace(array('&#xad;', '&#173;', '&shy;'), ' ', $v) : $v));

Line 649:  return str_replace(array("\x01", "\x02", "\x03", "\x04", "\x05", "\x07"), array('<', '>', "\n", "\r", "\t", ' '), $t);

Detecting character encoding, identifying encoding issues in input, and correcting them is an entirely different kind of text filtering that I think should be kept separate from htmLawed. But, I will add to documentation the advice that administrators check the input for well-formedness (using code suggested by others such as on the web-page that you cite) before passing it to htmLawed. It may also be useful to suggest temporarily converting input text's encoding to UTF-8 if the text is in an encoding like UTF-16 whose handling by htmLawed one is not absolutely sure of.

8

Re: Problems with UTF-8?

Patnaik, thanks a lot for you detailed feedback!