PHP Labware internal utilities / htmLawed
htmLawed HTML filter to secure TinyMCE and other Javascript-based WYSIWYG editors

Javascript-based WYSIWYG ("What You See Is What You Get") rich text/HTML editing applications like Xinha, TinyMCE, htmlArea, and FCKeditor are commonly used on blogs, content management systems (CMSs), forums, and wikis to simplify the input of HTML-formatted text. Though the Javascript applications can be configured to restrict HTML tags and attributes in user-submitted input, it is possible for users to bypass such restrictions, for instance, by simply turning off Javascript in their browsers. Server-side checks of user-submitted text is therefore required to ensure security and/or compliance with administrative policies.

htmLawed is a simple, fast and easy-to-use and -configure script for such backend checks when the PHP engine is available on the server. Not only can it filter HTML in the input, it can also sanitize it by transforming deprecated attributes, applying anti-spam measures, correcting unbalanced tags, tidying with indentations, etc.

To use htmLawed, the administrator has to create a PHP script that uses the htmLawed.php file. The user-submitted content, typically a PHP $_POST variable value, is passed to the htmLawed() function for checks and cleansing.

Simple code like shown below is enough to ensure that the input text is secure and free of scripting attack codes. The 'safe' setting for htmLawed results in the stripping off of dangerous HTML tags like 'script', and attributes like 'onclick' (more on the safe parameter; XSS filtering examples).
$out = htmLawed($in, array('safe'=>1));

Code like the one shown below can be used to additionally ensure that only the indicated HTML tags get through.
$out = htmLawed($in, array('safe'=>1, 'elements'=>'a, b, strong, i, em, li, ol, ul'));

htmLawed can be configured even more. Following is some PHP code that illustrates refined htmLawed usage with the input value received from the TinyMCE editor. (Code will vary as per one's requirements.)
// Strip slashes from the user input in case PHP has magic quotes enabled
$in
= $_POST['tTinyMceTextArea'];
if(get_magic_quotes_gpc()){
    $in  
= stripslashes($in);
}

// Include htmLawed script; in same directory as this script
include_once
('./htmLawed.php');

// Set htmLawed; some configuration need not be specified because the default behavior is good enough
$config
= array(
   
'safe'=>1, // Dangerous elements and attributes thus not allowed
   
'elements'=>'* -table -tr -td -th -tfoot -thead -col -colgroup -caption', // All except table-related are OK
   
'deny_attribute'=>'class, id, style' // None of the allowed elements can have these attributes
);
$spec = 'a = -*, title, href;' // The 'a' element can have only these attributes

// The filtering
$out
= htmLawed($in, $config, $spec);

htmLawed does not have a configurable option for directly checking and filtering CSS properties and their values in 'style' attributes of HTML elements. This is because of the wide variety of combinations that are possible with multiple elements, CSS properties and their administrator-defined allowable values. htmLawed does have indirect options to check CSS properties. E.g., through use of the 'spec' argument to check 'style' if too many different values are not expected, or to restrict 'class' values while disallowing the 'style' attribute and asking users to use 'class' and not 'style'. A third option is to use the 'hook_tag' parameter in the 'config' argument of htmLawed to use a custom function to check, filter and transform CSS properties and their values in 'style'. The code below shows an example of such a function. It restricts 'style' to certain elements, and allows only certain CSS style properties (such as 'font-family' and 'vertical-align') and property values.
// Filtering for specific CSS properties and their values for some elements/tags 
// Define 'hook_tag' of htmLawed config. as the name of the function coded below
// $config = array( ..., 'hook_tag' => 'my_style_check' , ...);

// htmLawed will pass, one by one, every element's name and all its attributes to this function
// A check is needed only if the element is of a certain type and has the 'style' attribute

// We will allow style properties where they have no meaning. E.g., 'list-style-type' in 'img' element
// This is harmless, and reduces filtering time

// We annul 'style' if there are weird characters in its value indicating malintent

function my_style_check($element, $attribute_array){

 
// Only some elements can have 'style' and its value should not look fishy
 
// I.e., only alphanumeric characters, spaces, colons, semi-colons, commas, number-signs and single-quotes are permitted
 
 
static $allowedElements = array('a', 'span', 'p', 'img', 'ul', 'ol', 'li');
  $badMatch
= "`[^\w\s;:,#\-']`";
 
static $allowedProperties = array('border', 'color', 'display', 'float', 'font-family', 'font-size', 'list-style-type', 'margin', 'margin-left', 'margin-right', 'margin-top', 'margin-bottom', 'text-align', 'text-decoration', 'vertical-align');
 
 
if(in_array($element, $allowedElements) && isset($attribute_array['style']) && !preg_match($badMatch, $attribute_array['style'])){
 
    $style
= $attribute_array['style'];
   
// Remove unnecessary white-space
    $style
= str_replace(array("\r", "\n", "\t"), '', $style);
   
   
// Identify CSS property names and values in 'style' value
    $properties
= explode(';', $style);
    $finalProperties
= array();
   
foreach($properties as $namevalue){
      $namevalue
= explode(':', trim($namevalue));
      $name
= strtolower(trim($namevalue[0]));
      $value
= isset($namevalue[1]) ? $namevalue[1] : 0;
     
if($value and in_array($name, $allowedProperties)){
     
     
// Check property values
     
// Note many different ways of doing this
     
// Write code considering possible input values
        $value
= trim($value);
       
switch($name){
         
case 'border':
           
if(stripos('solid black', $value)){
              $finalProperties
[] = 'border: '. $value;
           
}
           
break;
         
case 'color':
         
case 'margin-top':
         
case 'margin-bottom':
            $finalProperties
[] = $name. ': '. $value;
           
break;
         
case 'display':
           
if(stripos(' block', $value)){
              $finalProperties
[] = 'display: '. $value;
           
}
           
break;
         
case 'float':
           
if(stripos(' left right', $value)){
              $finalProperties
[] = 'float: '. $value;
           
}
           
break;
         
case 'font-size':
           
if(stripos(' xx-small medium large xx-large', $value)){
              $finalProperties
[] = 'font-size: '. $value;
           
}
           
break;
         
case 'font-family':
            $fonts
= explode(',', $value);
            $finalFonts
= array();
           
foreach($fonts as $font){
              $font
= trim(strtolower($font), " '\"");
             
if(in_array($font, array('andale mono','arial', 'arial black', 'avant garde', 'chicago', 'comic sans ms', 'courier', 'courier new', 'geneva', 'georgia', 'helvetica', 'impact', 'monaco', 'tahoma', 'terminal', 'times', 'times new roman', 'trebuchet ms', 'verdana', 'serif', 'san-serif'))){
                $finalFonts
[] = $font;
             
}
           
}
           
if(!empty($finalFonts)){
              $finalProperties
[] = 'font-family: '. implode(', ', $finalFonts);
           
}
           
break;
         
case 'list-style-type':
           
if(stripos(' circle disc square lower-roman upper-roman lower-greek upper-greek lower-alpha upper-alpha', $value)){
              $finalProperties
[] = 'list-style-type: '. $value;
           
}
           
break;
         
case 'margin-left':
         
case 'margin-right':
           
if((strtolower($value) == 'auto') or (preg_match('`(\d+)\s*px`i', $value, $m) and intval($m[1]) <601)){
              $finalProperties
[] = $name. ': '. $value;
           
}
           
break;
         
case 'text-align':
           
if(stripos(' left right center justify', $value)){
              $finalProperties
[] = 'text-align: '. $value;
           
}
           
break;
         
case 'text-decoration':
           
if(strtolower($value) == 'underline'){
              $finalProperties
[] = 'text-decoration: '. $value;
           
}
           
break;
         
case 'vertical-align':
           
if(stripos(' middle, bottom, top, baseline, text-top, text-bottom', $value)){
              $finalProperties
[] = 'vertical-align: '. $value;
           
}
           
break;        
       
} // END SWITCH-CASE  
     
}
   
} // END FOREACH LOOP

   
// Assign 'style' the filtered value
    $style
= implode('; ', $finalProperties);
   
if(!empty($style)){
      $attribute_array
['style'] = $style;
   
}else{
      unset
($attribute_array['style']);
   
}
 
}elseif(isset($attribute_array['style'])){
    unset
($attribute_array['style']);
 
} // END TOPMOST IF-ELSE CHECK

 
// Finally, return to htmLawed the element in the opening tag with attributes
  $attributes
= '';
 
foreach($attribute_array as $k=>$v){
    $attributes
.= " {$k}=\"{$v}\"";
 
}
 
static $empty_elements = array('area'=>1, 'br'=>1, 'col'=>1, 'embed'=>1, 'hr'=>1, 'img'=>1, 'input'=>1, 'isindex'=>1, 'param'=>1);
 
return "<{$element}{$attributes}". (isset($empty_elements[$element]) ? ' /' : ''). '>';
}

To learn more about configuring htmLawed, visit the htmLawed web-site: http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed.

htmLawed | PHP Labware home | visitors since 23 Oct'11
PHP HTML filter PHP Anti-XSS Class - HTML purify PHP - XSS library - PHP HTML purification - HTM purify - PHP sanitize class - anti XSS input filter - HTML standards compliance - PHP balance tags - HTML tag balance - PHP filter script. PHP filter library. HTMLPurifier comparison HTML purifier. Filter tags attributes elements XHTML spec specs standards. White-list black list tags. W3C specs