1 (edited by lriggle 2012-07-11 14:28:20)

Topic: Handling special tags like <?xml

Hi guys, I haven't been able to find anything in the search results as yet, but I'm wondering if anyone has had a need to handle tags like <?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />. Specifically, removing this sort of thing.

I've worked around it in the code I'm running, but I wasn't sure if it was handled somewhere else I wasn't noticing, or if there was a better place to modify it. I currently have my workaround located in hl_tag(). I've modified the first regex statement from

if(!preg_match('`^<(/?)([a-zA-Z][a-zA-Z1-6]*)([^>]*?)\s?>$`m', $t, $m)){

to

if(!preg_match('`^<(/?)([a-zA-Z\?][a-zA-Z1-6]*)([^>]*?)\s?>$`m', $t, $m)){

Thanks!

2

Re: Handling special tags like <?xml

htmLawed can handle only (X)HTML elements meant to be within the body sections of HTML documents. The input example you provide -- <?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /> -- appears to be something which one does not expect to encounter in an HTML body.

The pattern htmLawed uses to recognize element names exclude the question mark (?) and colon (:) characters. Therefore, '?xml:namespace' is not recognized.

Changing the regex in function hl_tag() [line 414] will not be enough to have '?xml:namespace' recognized. One will also need to put in support for recognition of the 'ns' attribute, etc.

Have a look at this section in the htmLawed documentation. May be it will help solve the issue.

3 (edited by lriggle 2012-07-11 17:43:59)

Re: Handling special tags like <?xml

The <?xml tag in question tends to show up when copy/pasted out of (primarily older versions) MS Word into a poorly constructed WYSIWYG editor.  For example, if you view the source code of http://schools.nyc.gov/schoolportals/17/k375/default.htm, you'll see a number of instances of <?xml:namespace throughout the code, both above and below <body>.

If you do a google search for "<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />", you'll notice a lot of other pages that have it in there as well, usually with the < and > replaced with &lt; and &gt; for whatever reason.

I agree that if I want to be able to modify the tag attributes, additional editing needs to be done for my workaround to be a viable solution, but as most use-cases would want these tags stripped, leaving the element out of the primary element array ensures that my project's custom default settings of 'never keep bad tags' will always force it to be removed.

My original question still stands though, since my solution above was implemented purely as a quick-and-dirty workaround: is there a better place/way to implement a solution that strips these kinds of tags out?

4

Re: Handling special tags like <?xml

Thanks for clarifying. I will have to contemplate on this, if this issue requires modifying htmLawed or if this is something that should be left for site-admins. to handle outside htmLawed.

Currently, htmLawed first (1) identifies potential HTML tags using the htmLawed() function using a regex pattern that essentially looks for '<...>' . The '<...>' blocks are passed to hl_tag() function which first checks (2) if they are more likely to be HTML tags using the regex pattern you refer to in your first post. If it is so, then htmLawed infers the HTML element name and (3) checks if the element is allowed. If not, then the tag is 'neutralized' with character entities, removed, etc., depending on the 'keep_bad' config. parameter.

If the check at (2) fails, htmLawed just neutralizes the tag, regardless of the 'keep_bad' value. This is what happens with the 'xml:namespace' input. May be I should modify htmLawed to incorporate a 'keep_bad' effect, for cases that fail at (2).

With the regex pattern change you suggest, 'xml:namespace' passes (2) and gets subjected to a 'keep_bad' effect.

So, yes, the code modification you suggest should work for the 'xml:namespace' issue. (I haven't tested it.)

A different modification to handle this issue (whose logic is alluded to above) will be to change this code at the beginning of hl_tag():

if(!preg_match('`^<(/?)([a-zA-Z][a-zA-Z1-6]*)([^>]*?)\s?>$`m', $t, $m)){
  return str_replace(array('<', '>'), array('&lt;', '&gt;'), $t);

to:

if(!preg_match('`^<(/?)([a-zA-Z][a-zA-Z1-6]*)([^>]*?)\s?>$`m', $t, $m)){
  return (($C['keep_bad']%2) ? str_replace(array('<', '>'), array('&lt;', '&gt;'), $t) : '');

(I have not tested this modification either.)

There are two other approaches, similar to each other, which do not require htmLawed modification. One is to just remove the 'xml:namespace' blocks before passing the text to htmLawed:

$text = preg_replace('`<?xml:namespace[^>]+>`i', '', $text);
$text = htmLawed($text...);

The other approach is to do this through the 'hook' config. parameter. If this parameter is declared, then htmLawed will pass input text to the function named by 'hook'. This function can simply return text without  'xml:namespace' blocks using a logic like the one above.

5

Re: Handling special tags like <?xml

Thanks for the additional suggestions on how to handle this! I'm going to dive into it and figure out which method will best meet the needs we have and then incorporate it into the gitHub class that I've been helping with.

6

Re: Handling special tags like <?xml

I think, at least for the time being, I will not modify htmLawed (something like the second method, above) to address this issue.

It is challenging to discern tags/elements in input text while trying to use fuzzy logic to account for  typographical errors (like the first space in '< img src...>'), non-standard tags (like '<?xml...>' or '<fb:like>') and phrases with the greater- or lesser-than characters ('<', '>') used for their literal import.

For the moment, I suggest using the third or fourth approach (above post).

If you are trying to test different htmLawed code modifications for a better logic, below are some tricky test-cases to play with; use a single as well as multiple lines as the input text.

3 < 4
5 > 4
? > 3
<._.> hi!
<<< ALERT >>>
<![if !vml]> some stuff <![endif]>
<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />
<uml:ns ns = "urn:www">
<uml:ns ns = 'urn:www'>
if(12<age AND 20>age){say 'teen'}
age >51 and a smoking history of >21 pack-years <b>was</b>
age > 51 and a smoking history of >21 pack-years <b>was</b>
age <51 and a smoking history of <21 pack-years <b>was</b>
age < 51 and a smoking history of < 21 pack-years <b>was</b>
<b>age >51 and a smoking history of >21 pack-years</b>
<b>age > 51 and a smoking history of >21 pack-years</b>
<b>age <51 and a smoking history of <21 pack-years</b>
<b>age < 51 and a smoking history of < 21 pack-years</b>