Topic: Question, maybe a bug in regular expression parsing in spec
Hi, I'm Rafa from Spain. Sorry for my poor English.
I'm developing a web site with the symfony framework (http://www.symfony-project.org). I was looking for information about security, XSS and validation for correctness of HTML input. After a while I decided not to implement bbCode or similar solutions, but to use HTMLPurifier or a similar thing to validate and limitate the input coupled with TinyMCE.
Let me tell you, I was painfully defeated by HTMLPurifier. So many classes, configuration directives... A whole on-disk cache system... When I found htmLawed and saw its footprint (source size over all) instantly abandoned HTMLPurifier.
On subject again. I'm trying to use a very tight configuration where only a few tags and very few attributes are allowed. All worked like a charm except when I tried to use regular expressions to control the attributes content.
An example of rule I'm using is:
...
$rule[] = 'span=-*, style(match="/^(?:\s)*text-decoration: line-through;(?:\s)*$/i")';
...
$spec = implode(';', $rules);The matter is that htmLawed appears to ignore the regular expression. As you can see, the example is a very simple one.
I digged a bit and found that the hl_spec function, that appears to be the function that parses the spec, manipulate and modify the regular expressions founds in the spec, validating them. I'm not a good reg-exp ninja, so probably you can find the problem. The regular expression that ends in hl_regexp is like this one:
"/^(?:\s)*text-decoration: line-through;(?:\s)*$/i\Please note the initial double-quotes and the final back-slash.
My solution right now is replacing the next line:
$y[$an][strtolower(substr($m, 0, $pm))] = str_replace(array("\x01", "\x02", "\x03", "\x04", "\x05", "\x06", "\x07", "\x08"), array(";", "|", "~", " ", ",", "/", "(", ")"), substr($m, $pm+1));With this one:
$pm = strpos($m, '=');
$index = strtolower(substr($m, 0, $pm));
$exp = substr($m, $pm+2, strlen($m)-($pm+3));
$final_exp = str_replace(array("\x01", "\x02", "\x03", "\x04", "\x05", "\x06", "\x07", "\x08"), array(";", "|", "~", " ", ",", "/", "(", ")"), $exp);
$y[$an][$index] = $final_exp;As you can see, I like variables very much ;). It's just for clarity. The diff really is in the parameters passed to the substr function, where I raise the initial index by one and remove an additional character from the end, resulting in no double-quotes nor back-slash. but I'm sure there are much better ways to correct this. Even now I'm not really sure all this is not my fault or an artifact/bug in other place.
I'm using an up-to-date Ubuntu Hardy Heron, that ships with php 5.2.4.
So this is part a question (is all this my fault?) part a bug report. Thanks in advance, and sorry if the bug is evident, but i'm tired. It's hard to swim in other's code... I'm writing too much, better go sleep. Regards.