1

Topic: HTML Entities in attribute values

I've been playing around with the various config options for a while but can't figure out a way to tell htmLawed to convert any HTML characters, such as <, >, and " to their HTML entity equivalents (&lt;, &gt;, &quot; etc) before running its other rules.

The problem I have is in a system where some attribute values contain tags for previously used hacky JavaScript stuff.  What happens is htmLawed's filters treat these attribute values as actual tags and break everything.  For example:

<a href="#" title="<img src='image.gif' />" onclick="function1(); function2();">Do stuff</a>

When that is run through htmLawed it becomes this:

<a href="#">" onclick="function1(); function2();"&gt;Do stuff</a>

Obviously this is not good!  What I expect it to output (presuming we're allowing onclick attribute in the config) is this:

<a href="#" title="&lt;img src='image.gif' /&gt;" onclick="function1(); function2();">Do stuff</a>

Can anyone give me ideas?  Is this just a bug? (Or a feature that hasn't yet been implemented!)

2

Re: HTML Entities in attribute values

Just a further note - it breaks even when I tell it to remove all title attributes as well.

3

Re: HTML Entities in attribute values

htmLawed primarily works by identifying tags in the input text by looking for the '<' and '>' characters. The assumption of filters/purifiers like htmLawed has to be that something like <a <img src="x" />> is an error and not a deliberate construct.

So, in the case you mention, there is no way to get around htmLawed's behavior.

However, you can attempt to run some custom code on the input first before passing it to htmLawed. The logic of the custom code will depend on the type of 'HTML errors' expected in the inputs. If the input is HTML-correct except for the instances of tags within attributes of elements, you can try:

// Function to entitify < and > within attribute values
function my_function($match){
 return '="'. str_replace(array('<', '>'), array('&lquot;', '&rquot;'), $match[1]). '"';
}

// Entitify tags within quoted attribute values by passing
// input through the above function
$text = preg_replace_callback('`\s*=\s*"(.+)"`', 'my_function' , $text);

// Run htmLawed
$config = array(...);
$out = htmLawed($text, $config...);

4

Re: HTML Entities in attribute values

Thanks for the feedback.  I agree that it should be considered erroneous input but at the end of the day, I'm using this because I fully expect people to throw completely broken HTML at it!