1 (edited by DrLightman 2016-11-03 09:49:58)

Topic: Stripping away <script>*</script> and <style>*</style> blocks

I'm trying to strip away script and style code blocks entirely.

HTML in:

<style type="text/css">
.foo { color:blue; }
</style>
<div class="foo">hello world</div>
<script type="text/javascript">
    if (true) {
        var foo = 'bar';
    }
</script>

Tried with this htmlLawed:

$htmlOut = htmLawed(
    $htmlIn,
    array(
        'comment' => 1,
        'safe' => 1,
    )
);

What I'd like:

<div class="foo">hello world</div>

What I get instead:

.foo { color:blue; }

<div class="foo">hello world</div>

    if (true) {
        var foo = 'bar';
    }

I'm not sure it is even possibile. Tried also all keep_bad possibile values with no luck.

2

Re: Stripping away <script>*</script> and <style>*</style> blocks

Sorry about the delay in posting my reply.

You are right – it is not possible to have htmLawed set to remove element content. The "keep_bad" config. parameter primarily concerns tag content ("<tag ...>" and "</tag>") and its settings dictate the handling of disallowed tags (remove, neutralize, etc.).

During tag balancing, enabled by default with config. parameter "balance", htmLawed checks if the plain-text element content following the opening tag, and any neutralized opening tag (which now is plain text) is allowed in the parent element; if not, it is removed if "keep_bad" is appropriately set.

It is possible for htmLawed to get the functionality that your posting indicates, but I have avoided including it because of the complicated scenarios it may need to deal with (such as element content containing more elements).

For the current version of htmLawed, the best way I can suggest to handle the cases that you mention is to remove the "script" or "style" elements with a separate PHP function call. E.g.

$temp = htmLawed($in ...);
$out = preg_replace('`<((script)|(style))[^>]*>[^<]*</\1>`si', $temp);