1

Topic: Removing script rather than converting to text

Hi,

First, thanks for this function. It is working very well for me. I have two questions.

1. Any script instead of being removed is becoming text so I see something like

<script>...........</script>

in the content.

Is there a configuration to get the entire script removed rather than converting it to text?

2. Is there any config to linkify URLs? Can the scrpt add the <a href=""> etc., to text that looks like www.htmlawed.com ?

My config is
       $cleanedhtml=htmLawed($rawhtml,array('safe'=>1,'elements'=>'p,a,b,strong,h2,h3,h4,ol,ul,li,em,img,br,embed','balance'=>1,'anti_link_spam'=>array('`.`',""),'anti_mail_spam'=>'__at__','comment'=>1,'schemes'=>'href: http','clean_ms_char'=>1,'deny_attribute'=>'class,style,on*,id,script'),'a=-*,href,title;img=height(minval/1),width(maxval=480/minval=1)');

Thanks in advance.

2

Re: Removing script rather than converting to text

1. Have a look at the keep_bad $config parameter.

2. No, htmLawed won't do that, but you can easily find a simple code snippet on the internet (regular-expression-based) to parse the input before it is handed to htmlawed.

3

Re: Removing script rather than converting to text

Patnaik,

Thanks for the quick response. I added keep_bad as follows:
       $cleanedhtml=htmLawed($rawhtml,array('safe'=>1,'elements'=>'p,a,b,strong,h2,h3,h4,ol,ul,li,em,img,br,embed','keep_bad'=>0,'balance'=>1,'anti_link_spam'=>array('`.`',""),'anti_mail_spam'=>'__at__','comment'=>1,'schemes'=>'href: http','clean_ms_char'=>1,'deny_attribute'=>'class,style,on*,id,script'),'a=-*,href,title;img=height(minval/1),width(maxval=480/minval=1)');

But that did not make a difference. I see the script still being converted to text rather than being removed.  Is there a conflicting directive in my configuration that is causing the problem?

Thanks.

4

Re: Removing script rather than converting to text

I think I misunderstood your post. The parameter 'keep_bad' cannot be used for removing element content (the text between the 'script' tags in your case).

Can you use a regular expression-based filter before passing the input to htmLawed, like:

// Pre-filter all '<script...>...</script>'
$in = preg_replace('`<script[^<]+</script>`'im, '', $in);
// Pass on to htmLawed
$config = array(...);
$out = htmLawed($in, $config...);