1

Topic: Question, maybe a bug in regular expression parsing in spec

Hi, I'm Rafa from Spain. Sorry for my poor English.

I'm developing a web site with the symfony framework (http://www.symfony-project.org). I was looking for information about security, XSS and validation for correctness of HTML input. After a while I decided not to implement bbCode or similar solutions, but to use HTMLPurifier or a similar thing to validate and limitate the input coupled with TinyMCE.

Let me tell you, I was painfully defeated by HTMLPurifier. So many classes, configuration directives... A whole on-disk cache system... When I found htmLawed and saw its footprint (source size over all) instantly abandoned HTMLPurifier.

On subject again. I'm trying to use a very tight configuration where only a few tags and very few attributes are allowed. All worked like a charm except when I tried to use regular expressions to control the attributes content.

An example of rule I'm using is:

...
$rule[] = 'span=-*, style(match="/^(?:\s)*text-decoration: line-through;(?:\s)*$/i")';
...
$spec = implode(';', $rules);

The matter is that htmLawed appears to ignore the regular expression. As you can see, the example is a very simple one.

I digged a bit and found that the hl_spec function, that appears to be the function that parses the spec, manipulate and modify the regular expressions founds in the spec, validating them. I'm not a good reg-exp ninja, so probably you can find the problem. The regular expression that ends in hl_regexp is like this one:

"/^(?:\s)*text-decoration: line-through;(?:\s)*$/i\

Please note the initial double-quotes and the final back-slash.

My solution right now is replacing the next line:

$y[$an][strtolower(substr($m, 0, $pm))] = str_replace(array("\x01", "\x02", "\x03", "\x04", "\x05", "\x06", "\x07", "\x08"), array(";", "|", "~", " ", ",", "/", "(", ")"), substr($m, $pm+1));

With this one:

$pm = strpos($m, '=');
$index = strtolower(substr($m, 0, $pm));
$exp = substr($m, $pm+2, strlen($m)-($pm+3));
$final_exp =  str_replace(array("\x01", "\x02", "\x03", "\x04", "\x05", "\x06", "\x07", "\x08"), array(";", "|", "~", " ", ",", "/", "(", ")"), $exp);
$y[$an][$index] = $final_exp;

As you can see, I like variables very much ;). It's just for clarity. The diff really is in the parameters passed to the substr function, where I raise the initial index by one and remove an additional character from the end, resulting in no double-quotes nor back-slash. but I'm sure there are much better ways to correct this. Even now I'm not really sure all this is not my fault or an artifact/bug in other place.

I'm using an up-to-date Ubuntu Hardy Heron, that ships with php 5.2.4.

So this is part a question (is all this my fault?) part a bug report. Thanks in advance, and sorry if the bug is evident, but i'm tired. It's hard to swim in other's code... I'm writing too much, better go sleep. Regards.

2

Re: Question, maybe a bug in regular expression parsing in spec

When specifying $spec, certain special characters need to be escaped. The semi-colon ';', that I see in the value you use, is one of them. From the documenation:

Special characters: The characters ;, ,, /, (, ), |, ~ and space have special meanings in the rules. Words in the rules that use such characters, or the characters themselves, should be flanked by double-quotes ("). A back-tick (`) can be used to escape a literal " inside such words. An example rule illustrating this is input=value(maxlen=30/match="/^\w/i"/default="your `"ID`"").

Let me know if you have issues even after escaping such characters.

3 (edited by Galvesband 2008-06-30 03:27:21)

Re: Question, maybe a bug in regular expression parsing in spec

I readed that paragraph ten times before posted my initial comment, and yet it seems I don't get it well enough.

As I understand it, you mean I have to add double-quotes surrounding the semicolon. I thought adding double-quotes around the regular expression was what that paragraph was saying. But I just tested adding double-quotes around the semicolon and don't seems to work either. Can you explain with a working example, or with words even I could understand?

I tried this:

$rules[] = 'span=-*, style(match="/^(?:\s)*text-decoration: line-through";"(?:\s)*$/i")';

Thanks in advance.

4

Re: Question, maybe a bug in regular expression parsing in spec

I am sorry for suggesting you had a mistake in the $spec value and wasting some of your time. I will look into this and report.

5

Re: Question, maybe a bug in regular expression parsing in spec

Thank you Rafa for pointing this out: it indeed is a bug.

It is fixed (just in time) in the new 1.1 version of htmLawed.

The problem was because of a misquoted (?) match reference in a preg_replace call in the hl_spec function. Instead of \'\\0\', a simpler "$0" is now used. I think you'll be able to find-replace in your htmLawed.php file.

... preg_replace('/"(?>(`.|[^"])*)"/sme', 'substr(str_replace(array(";", "|", "~", " ", ",", "/", "(", ")", \'`"\'), array("\x01", "\x02", "\x03", "\x04", "\x05", "\x06", "\x07", "\x08", "\""), "$0"), 1, -1)', trim($t)));

Probably you noticed this: $spec can also be specified as an array:

array(
  'span'=>array(
    'style'=>array(
      'match'=>'\\i',
      'maxlen'=>50,
      ...
    );
    ...
  );
  ...
);

Also, the new 1.1 version of htmLawed enables using a custom hook function to finely check attribute values; a sort of alternative/supplement for $spec.

6

Re: Question, maybe a bug in regular expression parsing in spec

Thank you for your fast support. I fixed my copy and removed the double-quotes around the semicolons and all worked just as the documentation said. Also I'm glad to report in time for a new release. I will probably update to the new version.

Again, thank you.