Thanks again for the various comments and suggestions. I have now done a kind of audit of the htmLawed code.
Including those that you mention, there are about 50 instances of use of PHP's PCRE regular expression-related functions ('preg' in function name) in htmLawed's code (there are no 'ereg' functions). There are four such functions in use: preg_match, preg_replace, preg_replace_callback and preg_split. All instances deal with parts of input text for HTML attribute names, HTML element names, HTML entities, URL protocols (like http), white spaces, other low ASCII characters (like NUL), and values set by the administrator for configurable htmLawed parameters. That is, the characters looked for are all below ASCII decimal code-point 128.
Therefore, their functionality is encoding-independent and should be maintained if instead of the traditional ASCII (7-bit bytes for 128 code-points), encodings that use 8-bit bytes for 256 code-points, such as national variants of ASCII (such as ISO-646-DE/German of the ISO 646 standard), extended ASCII variants (such as ISO 8859-10/Turkish of the ISO 8859/ISO Latin standard), ISO 8859-based Windows variants (such as Windows 1252), EBCDIC, Shift JIS, GB-Roman (Chinese), and KS-Roman (Korean), or ASCII-compatible encodings like UTF-7 (Unicode), UTF-8 (Unicode), EUC-CN (Chinese) and EUC-TW (Taiwanese) are used. With these encodings, the same correspondence exists between characters and code-points for the specific characters that htmLawed bases its logic on (such as 0-9, a-z, ':', '<', '>', and '&', ).
There are about 100 instances of use of PHP's string functions ('str' in function name) in htmLawed's code. There are nine such functions in use: str_repeat, str_replace, strcspn, strlen, strpos, strtolower, substr, substr_replace, strtr. I am sure that character encoding is not an issue for any of them, because of the nature of the arguments received by the functions (for purposes such as those mentioned above).
E.g., with the string functions in the code snippet in point #2 of your last post, an HTML element's name is being identified from an opening tag. The tag may have attribute name-value pair(s). So the element is identified as the sub-string before the first occurrence of a space character. The sub-string, being an element name, can only have a-z and 1-6 characters. Thus, to answer your specific question in point #2, there should not be an issue if a non-(US) ASCII encoding like the one listed above is used (including UTF-8).
Similarly, there should not be an issue with the string functions in the code snippet in point #3 of your last post. In the instance where it is used, a potential attribute name (which by definition can only have a-z, colon and hyphen characters, and by logic can only be at the beginning of a string) has been identified and the string is being shortened by removing the attribute name at its front.
--
Overall, my impression is that there is no need to use PHP's multi-byte string functions, or allow for their use when available, unless encoding-related issues are discovered and there is no other good option (unlikely).
There are likely to be issues with ASCII-incompatible encodings such as double 8-bit-byte-using UTF-16 (Unicode), Big Five (Chinese), JIS X 0208:1997 (Japanese) and K SX 1001:1992 (Korean), and quadruple 8-bit-byte-using UTF-32 (Unicode). Here are some examples of the instances where such issues can arise (line numbers of htmLawed.php, version 1.1.14)
Line 502: $v = strtr($v, $sC);
Line 507: $v = str_replace("\xad", ' ', (strpos($v, '&') !== false ? str_replace(array('­', '­', '­'), ' ', $v) : $v));
Line 649: return str_replace(array("\x01", "\x02", "\x03", "\x04", "\x05", "\x07"), array('<', '>', "\n", "\r", "\t", ' '), $t);
Detecting character encoding, identifying encoding issues in input, and correcting them is an entirely different kind of text filtering that I think should be kept separate from htmLawed. But, I will add to documentation the advice that administrators check the input for well-formedness (using code suggested by others such as on the web-page that you cite) before passing it to htmLawed. It may also be useful to suggest temporarily converting input text's encoding to UTF-8 if the text is in an encoding like UTF-16 whose handling by htmLawed one is not absolutely sure of.