Thanks for clarifying. I will have to contemplate on this, if this issue requires modifying htmLawed or if this is something that should be left for site-admins. to handle outside htmLawed.
Currently, htmLawed first (1) identifies potential HTML tags using the htmLawed() function using a regex pattern that essentially looks for '<...>' . The '<...>' blocks are passed to hl_tag() function which first checks (2) if they are more likely to be HTML tags using the regex pattern you refer to in your first post. If it is so, then htmLawed infers the HTML element name and (3) checks if the element is allowed. If not, then the tag is 'neutralized' with character entities, removed, etc., depending on the 'keep_bad' config. parameter.
If the check at (2) fails, htmLawed just neutralizes the tag, regardless of the 'keep_bad' value. This is what happens with the 'xml:namespace' input. May be I should modify htmLawed to incorporate a 'keep_bad' effect, for cases that fail at (2).
With the regex pattern change you suggest, 'xml:namespace' passes (2) and gets subjected to a 'keep_bad' effect.
So, yes, the code modification you suggest should work for the 'xml:namespace' issue. (I haven't tested it.)
A different modification to handle this issue (whose logic is alluded to above) will be to change this code at the beginning of hl_tag():
if(!preg_match('`^<(/?)([a-zA-Z][a-zA-Z1-6]*)([^>]*?)\s?>$`m', $t, $m)){
return str_replace(array('<', '>'), array('<', '>'), $t);
to:
if(!preg_match('`^<(/?)([a-zA-Z][a-zA-Z1-6]*)([^>]*?)\s?>$`m', $t, $m)){
return (($C['keep_bad']%2) ? str_replace(array('<', '>'), array('<', '>'), $t) : '');
(I have not tested this modification either.)
There are two other approaches, similar to each other, which do not require htmLawed modification. One is to just remove the 'xml:namespace' blocks before passing the text to htmLawed:
$text = preg_replace('`<?xml:namespace[^>]+>`i', '', $text);
$text = htmLawed($text...);
The other approach is to do this through the 'hook' config. parameter. If this parameter is declared, then htmLawed will pass input text to the function named by 'hook'. This function can simply return text without 'xml:namespace' blocks using a logic like the one above.