/*
htmLawed_README.txt
htmLawed 1.0, 2 November 2007
Copyright Santosh Patnaik
GPL 3 license
A PHP Labware internal utility - http://bioinformatics.org/phplabware/internal_utilities/htmLawed
*/


== Content ==========================================================


1  About htmLawed
  1.1  Example uses
  1.2  Features
  1.3  History
  1.4  License & copyright
  1.5  Terms used here
2  Usage
  2.1  Simple
  2.2  Configuring htmLawed using the '$config' parameter
  2.3  Extra HTML specifications using the '$spec' parameter
  2.4  Performance time and memory usage
  2.5  Some security risks to keep in mind
  2.6  Using without modifying old 'kses()' code
  2.7  Tolerance for ill-written HTML
  2.8  Limitations & work-arounds
3  Details
  3.1  Invalid/dangerous characters
  3.2  Character references/entities
  3.3  HTML elements
    3.3.1  HTML comments and 'CDATA' sections
    3.3.2  Tag-transformation for better XHTML-Strict
    3.3.3  Tag balancing and proper nesting
  3.4  Attributes
    3.4.1  Auto-addition of XHTML-required attributes
    3.4.2  Duplicate/invalid 'ID' values
    3.4.3  URL schemes (protocols) and scripts in attribute values
    3.4.4  Absolute & relative URLs
    3.4.5  Lower-cased, standard attribute values
    3.4.6  Transformation of deprecated attributes
    3.4.7  Anti-spam & 'href'
    3.4.8  Inline style properties
  3.5  Simple configuration directive for most valid XHTML
  3.6  Using a hook function
4  Other
  4.1  Support
  4.2  Known issues
  4.3  Change-log
  4.4  Testing
  4.5  Upgrade
  4.6  Donate
5  Appendices
  5.1  Characters discouraged in HTML
  5.2  Valid attribute-element combinations
  5.3  CSS (2.1) properties accepting URLs
  5.4  Microsoft character replacements
  5.5  URL format


== 1  About htmLawed ================================================


htmLawed is a single-file PHP software that makes input text secure and more standard-compliant, and suitable in general from the viewpoint of a web-page administrator, for use in the body of HTML 4, or XHTML 1 or 1.1 documents. It thus is a customizable HTML/XHTML filter, processor, purifier, sanitizer, etc., like the 'Kses', 'HTMLPurifier', etc., PHP scripts.

The `lawing in` of input text is needed to ensure that HTML code in the text is standard-compliant, does not introduce security vulnerabilities, and does not break a web-page's design/layout. htmLawed does this by, for example, making HTML well-formed with balanced and properly nested tags, neutralizing code that may be used for cross-site scripting ('XSS') attacks, and allowing only specified HTML elements/tags and attributes.


-- 1.1  Example uses ------------------------------------------------


*  Filtering of text submitted as comments on blogs to allow only certain HTML elements

*  Making RSS/Atom newsfeed item content standard-compliant: often one uses an excerpt from an HTML document for the content, and with unbalanced tags, non-numerical entities, etc., such excerpts may not be XML-compliant.

*  Stricter XML standard-compliance like having lowercased 'x' in hexadecimal numeric entities becomes necesary as an XHTML document needs to be served as 'application/xml' because it has MathML content


-- 1.2  Features ---------------------------------------------------o


Key: '*' security feature, '^' standard compliance, '~' requires setting right options, '`' different from 'Kses'

*  HTML in input may be highly ill-written; htmLawed will make it *secure* and *standard-compliant*
*  output can be used in HTML 4, XHTML 1.0, *XHTML 1.1*, or even generic *XML* documents  ^~`

*  options to *restrict elements*  ^~`
*  proper closure of empty elements like 'img'  ^`
*  *deprecated elements* like 'u' can be transformed  ^~`
*  HTML *comments* and 'CDATA' sections can be permitted  ^~`
*  'script' elements can be permitted  ~

*  options to *restrict attributes*  ^~`
*  removal of *invalid attributes*  ^`
*  element and attribute names are *lower-cased*  ^
*  provides *required attributes*, like 'action' for 'form', when missing  ^`
*  *deprecated attributes* can be transformed  ^~`
*  attributes *declared only once*  ^`

*  options to *restrict attribute values*  ^~`
*  a value is declared for `empty` (`minimized`) attributes like 'checked'  ^
*  attributes with potentially dangerous values (that can cause buffer overflows and denial of service attacks) can be removed after checking their lengths or values  *~
*  *unique* 'id' attribute values can be ensured  ^~`
*  attribute values are enclosed in *double-quotes*  ^
*  *standard attribute values* are lower-cased (like 'type="password"')  ^`

*  *attribute-specific URL protocol/scheme restriction*  *~`
*  *dynamic expressions* in 'style' values can be disabled  *~`

*  non-numeric, named character entities not in the HTML standard are neutralized  ^`
*  hexadecimal numeric entities may be made decimal ones, or vice versa  ^~`
*  HTML-specific named character entities can be converted to numeric ones for generic XML use  ^~`

*  removes *null* characters from input  *
*  neutralizes potentially dangerous proprietary Netscape *Javascript entities*  *
*  removes *soft-hyphen* character (code-point '173' or '#xad') in attribute values -- a vulnerability in some versions of the Opera browser  *

*  *invalid characters* not allowed in HTML or XML are removed  ^`
*  *characters from Microsoft applications* like 'Word' that are discouraged in HTML or XML can be replaced with good ones  ^~`
*  entities for characters not allowed or discouraged in HTML or XML are neutralized  ^`
*  appropriately neutralizes '<', '&', '"', and '>' characters  ^*`

*  understands improperly spaced tag content (like, spread over more than a line) and properly spaces them  `
*  can *balance tags* for well-formedness  ^~`
*  can permit only *validly nested tags*  ^~`

*  fast, *non-OOP* code of ~45 kb incurring peak basal memory usage of ~0.5 MB
*  *compatible* with pre-exisiting code using 'Kses' (the filter used by 'WordPress')

*  optional *anti-spam* measures such as addition of 'rel="nofollow"' and link-disabling  ~`
*  optionally makes *relative URLs absolute*, and vice versa  ~`

*  *independent of character encoding* of input and does not affect it
*  *won't change formatting* of element content by affecting line-breaks, spaces or tabs outside tags but normalizes white spaces in tag content


-- 1.3  History ----------------------------------------------------o


htmLawed was developed for use with 'LabWiki', a wiki software developed at PHP Labware, as a suitable software could not be found. Existing PHP software like 'Kses' and 'HTMLPurifier' were deemed inadequate, slow or resource intensive, or dependent on external applications like 'HTML Tidy'.

htmLawed started as a modification of Ulf Harnhammar's 'Kses' (version 0.2.2) sofware. It still follows the 'Kses' `way`, and uses some of Ulf's code. htmLawed is compatible with code that uses 'Kses'; see section:- #2.6.


-- 1.4  License & copyright ----------------------------------------o


htmLawed is free and open-source software licensed under GPL license version 3:- http://www.gnu.org/licenses/gpl-3.0.txt, and copyrighted by Santosh Patnaik, MD, PhD.


-- 1.5  Terms used here --------------------------------------------o


*  `administrator` - person setting up the code to pass input through htmLawed; also, `user` or `admin`
*  `attributes` - name-value pairs like 'href="http://x.com"' in opening tags
*  `author` - `writer`
*  `entity` - markup like '&gt;' and '&#160;' used to refer to a character
*  `element` - HTML element like 'a' and 'img'
*  `element content` -  content between the opening and closing tags of an element, like 'click' of '<a href="x">click</a>'
*  `HTML` - implies XHTML unless specified otherwise
*  `input` - text string given to htmLawed to process
*  `processing` - involves filtering, correction, etc., of input
*  `scheme` - URL protocol like 'http' and 'ftp'
*  `specs` - standard specifications
*  `style property` - terms like 'border' and 'height' for which declarations are made in values for the 'style' attribute of elements
*  `tag` - markers like '<a href="x">' and '</a>' delineating element content; the opening tag can contain attributes
*  `tag content` - consists of tag markers '<' and '>', element names like 'div', and possibly attributes
*  `user` - administrator
*  `writer` - end-user like a blog commenter providing the input that is to be processed; also, `author`


== 2  Usage ========================================================oo


htmLawed should work with PHP 4.3 and higher. Either 'include()' the 'htmLawed.php' file or copy-paste the entire code.

To easily *test* htmLawed using a form-based interface, use the provided demo:- htmLawedTest.php web-page ('htmLawed.php' and 'htmLawedTest.php' should be in the same directory on the web-server).


-- 2.1  Simple ------------------------------------------------------


The input text to be processed, '$text', is passed as an argument of type string; 'htmLawed()' returns the processed string:

    $processed = htmLawed($text);

*Note*: If input is from a '$_GET' or '$_POST' value, and 'magic quotes' are enabled on the PHP setup, run 'stripslashes()' on it before passing to htmLawed.

By default, htmLawed will process the text allowing all valid HTML elements/tags, secure URL scheme/CSS style properties, etc. It will allow 'CDATA' sections and HTML comments, balance tags, and ensure proper nesting of elements. Such actions can be configured using two other optional arguments -- '$config' and '$spec':

    $processed = htmLawed($text, $config, $spec);

These extra parameters are detailed below.


-- 2.2  Configuring htmLawed using the '$config' parameter ---o


'$config' instructs htmLawed on how to tackle certain tasks. When '$config' is not specified, or not set as an array (e.g., '$config = 1'), htmLawed will take default actions. One or many of the task-action/value-specification pairs can be specified in '$config' as array key-value pairs (when a pair is not specified, htmLawed will take the default action for that task):

    $config = array('comment'=>0, 'cdata'=>1);
    $processed = htmLawed($text, $config);
    
Or,

    $processed = htmLawed($text, array('comment'=>0, 'cdata'=>1));

Below are the possible task-action / value-specification combinations. In PHP code, values that are integers should not be quoted and should be used as numeric types (unless meant as string/text).

Key: '*' default, '^' different default when htmLawed is used in the Kses-compatible mode (see section:- #2.6), '~' different default when 'valid_xhtml' is set to '1' (see section:- #3.5)

*abs_url*
Make URLs absolute or relative; '$config["base_url"]' needs to be set; see section:- #3.4.4

'-1' - make relative
'0' - no action  *
'1' - make absolute

*anti_link_spam*
Anti-spam; see section:- #3.4.7
      
'0' - no measure taken  *
'array("regex1", "regex2")' - will ensure a 'rel' attribute with 'nofollow' in its value in case the 'href' attribute value matches the regular expression pattern 'regex1', and/or will remove 'href' if its value matches the regular expression pattern 'regex2'. E.g., 'array("/./", "/://\W*(?!(abc\.com|xyz\.org))/")'. This is a parameter for advanced usage. See section:- #3.4.7 for more.

*anti_mail_spam*
Anti-spam; see section:- #3.4.7
      
'0' - no measure taken  *
'word' - '@' in mail address in 'href' attribute value is replaced with 'word' -- a word of admin's choice, like 'NOSPAM@' and 'AT'.

*balance*
Balance tags for well-formedness and proper nesting; see section:- #3.3.3

'0' - no
'1' - yes  *

*base_url*
Base URL value that needs to be set if '$config["abs_url"]' is not '0'; see section:- #3.4.4

*cdata*
Handling of 'CDATA' sections; see section:- #3.3.1

'0' - don't consider 'CDATA' sections as markup and proceed as if plain text  ^
'1' - remove
'2' - allow, but neutralize any '<', '>', and '&' inside by converting them to named entities
'3' - allow  *

*clean_ms_char*
Replace discouraged characters introduced by Microsoft Word, etc.; see section:- #3.1
      
'0' - no  *
'1' - yes
'2' - yes, but replace special single & double quotes with ordinary ones

*comment*
Handling of HTML comments; see section:- #3.3.1

'0' - don't consider comments as markup and proceed as if plain text  ^
'1' - remove
'2' - allow, but neutralize any '<', '>', and '&' inside by converting to named entities
'3' - allow  *

*css_expression*
Allow dynamic CSS expression by not removing the expression from CSS property values in 'style' attributes; see section:- #3.4.7

'0' - remove  *
'1' - allow  ^

*deny_attribute*
Denied HTML attributes; see section:- #3.4

'0' - none  *
'string' - dictated by values in 'string'

*elements*
Allowed HTML elements; see section:- #3.3

'* -center -dir -font -isindex -menu -s -strike -u' -  ~

*hexdec_entity*
Allow hexadecimal numeric entities and do not convert to the more widely accepted decimal ones, or convert decimal to hexadecimal ones; see section:- #3.2

'0' - no
'1' - yes  *
'2' - convert decimal to hexadecimal ones

*hook*
Name of an optional hook function to pre-process input string, and optionally '$config' or '$htm', before htmLawed starts its main work; see section:- #3.6

'0' - no hook function  *
'name' - 'name' is name of the hook function ('kses_hook'  ^)

*keep_bad*
Neutralize bad tags by converting '<' and '>' to entities, or remove them; see section:- #3.3.3

'0' - remove  ^
'1' - neutralize both tags and element content
'2' - remove tags but neutralize element content
'3' and '4' - like '1' and '2' but remove if text ('pcdata') is invalid in parent element
'5' and '6' * -  like '3' and '4' but line-breaks, tabs and spaces are left

*lc_std_val*
For XHTML compliance, predefined, standard attribute values, like 'get' for the 'method' attribute of 'form', must be lowercased; see section:- #3.4.5

'0' - no  ^
'1' - yes  *

*make_tag_strict*
Transform/remove these non-strict XHTML elements, even if they are allowed by the admin: 'applet' 'center' 'dir' 'embed' 'font' 'isindex' 'menu' 's' 'strike' 'u'; see section:- #3.3.2

'0' - no  ^
'1' - yes, but leave 'applet', 'embed' and 'isindex' elements that currently can't be transformed  *
'2' - yes, removing 'applet', 'embed' and 'isindex' elements and their contents (nested elements remain)  ~

*named_entity*
Allow non-universal named HTML entities, or convert to numeric ones; see section:- #3.2

'0' - convert
'1' - allow  *

*no_deprecated_attr*
Allow deprecated attributes or transform them; see section:- #3.4.6

'0' - allow  ^
'1' - transform, but 'name' attributes for 'a' and 'map' are retained  *
'2' - transform

*parent*
Name of the parent element, possibly imagined, that will hold the input; see section:- #3.3

*schemes*
Array of attribute-specific, comma-separated, lower-cased list of schemes (protocols) allowed in attributes accepting URLs; '*' covers all unspecified attributes; see section:- #3.4.3

'href: aim, feed, file, ftp, gopher, http, https, irc, mailto, news, nntp, sftp, ssh, telnet; *:file, http, https'  *
'*: ftp, gopher, http, https, mailto, news, nntp, telnet'  ^

*unique_ids*
'ID' attribute value checks; see section:- #3.4.2

'0' - no  ^
'1' - remove duplicate and/or invalid ones  *
'word' - remove invalid ones and replace duplicate ones with new and unique ones based on the 'word'; the admin-specified 'word', like 'my_', should begin with a letter (a-z) and can contain letters, digits, '.', '_', '-', and ':'.

*valid_xhtml*
Magic parameter to make input the most valid XHTML without needing to specify other relevant '$config' parameters; see section:- #3.5
      
'0' - no  *
'1' - will auto-adjust other relevant '$config' parameters (indicated by '~' in this list)

*xml:lang*
Auto-adding 'xml:lang' attribute; see section:- #3.4.1

'0' - no  *
'1' - add if 'lang' attribute is present
'2' - add if 'lang' attribute is present, and remove 'lang'  ~


-- 2.3  Extra HTML specifications using the $spec parameter --------o


The '$spec' argument should be used to not allow an otherwise legal attribute for an element, or to restrict the attribute's values. '$spec' is specified as a string of text containing one or more `rules`, with multiple rules separated from each other by a semi-colon (';'). E.g.,

    $spec = 'i=-*; td, tr=style, id, -*; a=id(match="/[a-z][a-z\d.:\-`"]*/i"/minval=2), href(maxlen=100/minlen=34); img=-width,-alt';
    $processed = htmLawed($text, $config, $spec);
    
Or,

    $processed = htmLawed($text, $config, 'i=-*; td, tr=style, id, -*; a=id(match="/[a-z][a-z\d.:\-`"]*/i"/minval=2), href(maxlen=100/minlen=34); img=-width,-alt');

A rule begins with an HTML *element* name(s) (`rule-element`), for which the rule applies, followed by an equal ('=') sign. A rule-element may represent multiple elements if comma (,)-separated element names are used. E.g., 'th,td,tr='.

Rest of the rule consists of comma-separated HTML *attribute names*. A minus ('-') character before an attribute means that the attribute is not permitted inside the rule-element. E.g., '-width'. To deny all atributes, '-*' can be used.

   Following shows examples of rule excerpts with rule-element 'a' and the attributes that are being permitted:

   *  'a=' - all
   *  'a=id' - all
   *  'a=href, title, -id, -onclick' - all except 'id' and 'onclick'
   *  'a=*, id, -id' - all except 'id'
   *  'a=-*' - none
   *  'a=-*, href, title' - none except 'href' and 'title'
   *  'a=-*, -id, href, title' - none except 'href' and 'title'

Rules regarding *attribute values* are optionally specified inside round brackets after attribute names in slash ('/')-separated `parameter = value` pairs. E.g., 'title(maxlen=30/minlen=5)'. None, or one or more of the following parameters may be specified:

*  'oneof' - one or more choices separated by '|' that the value should match; if only one choice is provided, then the value must match that choice

*  'noneof' - one or more choices separated by '|' that the value should not match

*  'maxlen' and 'minlen' - upper and lower limits for the number of characters in the attribute value; specified in numbers

*  'maxval' and 'minval' - upper and lower limits for the numerical value specified in the attribute value; specified in numbers

*  'match' and 'nomatch' - pattern that the attribute value should or should not match; specified as PHP/PCRE-compatible regular expressions with delimiters and possibly modifiers

*  'default' - a value to force on the attribute if the value provided by the writer does not fit any of the specified parameters

If 'default' is not set and the attribute value does not satisfy any of the specified parameters, then the attribute is removed. The 'default' value can also be used to force all attribute declarations to take the same value, by, e.g., setting 'maxlen' to '-1'.

Examples with input '<input title="WIDTH" value="10em" /><input title="length" value="5" />':

   Rule: 'input=title(maxlen=60/minlen=6), value'
   Output: '<input value="10em" /><input title="length" value="5" />'
   
   Rule: 'input=title(), value(maxval=8/default=6)'
   Output: '<input title="WIDTH" value="6" /><input title="length" value="5" />'
   
   Rule: 'input=title(nomatch=$w.d$i), value(match=$em$/default=6em)'
   Output: '<input value="10em" /><input title="length" value="6em" />'
   
   Rule: 'input=title(oneof=height|depth/default=depth), value(noneof=5|6)'
   Output: '<input title="depth" value="10em" /><input title="depth" />'
   
*Special characters*: The characters ';', ',', '/', '(', ')', '|', '~' and space have special meanings in the rules. Words in the rules that use such characters, or the characters themselves, should be flanked by double-quotes ('"'). A back-tick ('`') can be used to escape a literal '"' inside such words. An example rule illustrating this is 'input=value(maxlen=30/match="/^\w/i"/default="your `"ID`"")'.
   
*Note*: To deny an attribute for all elements for which it is legal, '$cfg["deny_attribute"]' can be used instead of '$spec'.


-- 2.4  Performance time and memory usage --------------------------o


As expected, the time and memory used by htmLawed depends on the size of the input and on the number of elements in it. These are also increased by certain '$config' values. In particular, balancing ('$config["balance"] = 1') can increase the processing time by a third or so.

One can use the page for testing:- htmLawedTest.php to evaluate performance and the effects of different types of input and '$config'.


-- 2.5  Some security risks to keep in mind ------------------------o


When setting the parameters/arguments (like those to allow certain HTML elements) for use with htmLawed, potentially dangerous code may get through. This may not be a problem if the authors are trusted.

For example, following increase security risks:

*  Allowing 'script', 'applet', 'embed', 'iframe' or 'object' elements, or certain of their attributes like 'allowscriptaccess'

*  Allowing HTML comments (some Internet Explorer versions are vulnerable with, e.g., '<!--[if gte IE 4]><script>alert("xss");</script><![endif]-->'


-- 2.6  Using without modifying old 'kses()' code ------------------o


(The 'WordPress' software uses 'kses' for HTML filtering.)

The 'kses.php' file should not be included as htmLawed also declares the 'kses()' and 'kses_hook()' functions. If a custom 'kses_hook()' function was in use, it's definition should be copied into the 'kses_hook()' function in 'htmLawed.php'. With these steps, old code like this will behave exactly as before:

    $comment_filtered = kses($comment_input, array('a'=>array(), 'b'=>array(), 'i'=>array()));

For some of the '$cfg' parameters, htmLawed will use values other than the default ones. These are indicated by '^' in section:- #2.2. To force htmLawed to use other values, function 'kses()' in 'htmLawed.php' should be edited -- a few configurable parameters/variables need to be changed.


-- 2.7  Tolerance for ill-written HTML -----------------------------o


htmLawed can work with ill-written HTML code in the input. However, ill-written HTML may not be `read` as HTML and be considered mere plain text by htmLawed. Following indicate the degree of `looseness` that htmLawed can work with, and can be provided in instructions to writers:

*  Tags must be flanked by '<' and '>' with no '>' inside -- any needed '>' should be put in as '&gt;' instead. It is possible for tag content (element name and attributes) to be spread over many lines instead of being on one. A space is possible between the tag content and '>', like '<div >' and '<img / >', but not after the '<'.

*  Element and attribute names may not be in lower-case.

*  Attribute string of elements may be liberally spaced with tabs, line-breaks, etc.

*  Attribute values may be un-quoted or single-quoted.

*  Entities must end with ';'. Left-padding of numeric entities (like, '&#0160;', '&x07ff;') with '0' is okay as long as the number of characters between between the '&' and the ';' does not exceed 8.

*  HTML comments should not be inside element tags (okay between tags), and should begin with '<!--' and end with '-->'. Characters like '<', '>', and '&' may be allowed inside depending on '$config', but any '-->' inside should be put in as '--&gt;'. Any '--' inside will be automatically converted to '-', and a space will be added before the comment delimiter '-->'.

*  'CDATA' sections should not be inside element tags, and can be in element content only if plain text is allowed for that element. They should begin with '<[CDATA[' and end with ']]>'. Characters like '<', '>', and '&' may be allowed inside depending on '$config', but any ']]>' inside should be put in as ']]&gt;'.

*  For attribute values, character entities '&lt;', '&gt;' and '&amp;' should be used instead of characters '<' and '>', and '&' (when '&' is not part of a character entity). This applies even for Javascript code in values of attributes like 'onclick'.

*  Characters '<', '>', '&' and '"' that are part of actual Javascript, etc., code in 'script' elements should be used as such and not be put in as entities like '&gt;'. Otherwise, though the HTML will be valid, the code may fail to work. Further, if such characters have to be used, then they should be put inside 'CDATA' sections.

*  Simple instructions like "an opening tag cannot be present between two closing tags" and "nested elements should be closed in the reverse order of how they were opened" can help authors write balanced HTML. If tags are imbalanced, htmLawed will try to balance them, but in the process, depending on '$config["keep_bad"]', some code may be lost.

*  Input authors should be notified of admin-specified allowed elements, attributes, configuration values (like conversion of named entities to numeric ones), etc.

*  With '$cfg["unique_ids"]' not '0' and the 'id' attribute being permitted, writers should carefully avoid using duplicate or invalid 'id' values as even though htmLawed will correct/remove the values, the final output may not be the one desired. E.g., when '<a id="home"></a><input id="home" /><label for="home"></label>' is processed into 
'<a id="home"></a><input id="prefix_home" /><label for="home"></label>'.

*  Note that even if intended HTML is lost in a highly ill-written input, the processed output will be more secure and standard-compliant.


-- 2.8  Limitations & work-arounds ---------------------------------o


htmLawed's main objective is to make the input text more HTML-standard compliant and secure for web visitors, and free of HTML elements and attributes considered undesirable by the administrator. Some of its limitations, regardless of this objective, are noted below. Future versions might address some of them.

*  htmLawed is meant for input that goes into the 'body' of HTML documents. HTML's head-level elements are not supported, nor are the frameset elements 'frameset', 'frame' and 'noframes'.

*  htmLawed doesn't `beautify` HTML code text by formatting it with indentations, etc.

*  It cannot transform the non-standard 'embed' elements to the standard-compliant 'object' elements. Yet, it can allow 'embed' elements if permitted ('embed' is widely used and supported).

*  The only non-standard element that may be permitted is 'embed'; others like 'noembed' and 'nobr' cannot be permitted without modifying the htmLawed code.

*  It cannot handle input that has non-HTML code like 'SVG' and 'MathML'. One way around is to break the input into pieces and passing only those without non-HTML code to htmLawed. Another is to `hide` '<', '>' and '&' of non-HTML code by converting them to invalid XML characters '#x06' to '#x08' before passing the input to htmLawed, and then reverting them post-processing.

*  By default, htmLawed won't check many attribute values for standard compliance. E.g., 'width="20m"' with the dimension in non-standard 'm' is let through. Implementing universal and strict attribute value checks can make htmLawed slow and resource-intensive. Admins can partially implement such features using '$spec'.

*  Except for contained URLs and dynamic expressions (also optional), htmLawed does not check CSS style property values. Again, this keeps htmLawed fast. Admins can partially implement this feature using '$spec'. Perhaps the best option is to disallow 'style' but allow 'class' attributes with the right 'oneof' ('$spec' )values for 'class', and have the various class style properties in '.css' CSS stylesheet files.

*  htmLawed does not parse emoticons, `BBcode`-decode, or `wikify`, auto-converting text to proper HTML. Similarly, it won't convert line-breaks to 'br' elements. Such functions are beyond its purview. Admins should use other code to pre- or post-process the input for such purposes.

*  htmLawed cannot be used to have links force-opened in new windows (by auto-adding appropriate 'target' and 'onclick' attributes to 'a'). Admins should look at Javascript-based DOM-modifying solutions for this.

*  Nesting-based checks are not possible. E.g., one cannot disallow 'p' elements specifically inside 'td' while permitting it elsewhere.

*  Except for optionally converting absolute or relative URLs to the other type, htmLawed will not alter URLs (e.g., to change the value of query strings or to convert 'http' to 'https'. Having absolute URLs may be a standard-requirement, e.g., when HTML is embedded in email messages, whereas altering URLs for other purposes is beyond htmLawed's goals.

*  Pairs of opening and closing tags that do not enclose any content (like '<em></em>') are not removed. This may be against the standard specs for certain elements, e.g., in the case of '<table></table>'.


== 3  Details =====================================================oo


-- 3.1  Invalid/dangerous characters --------------------------------


htmLawed removes all null and other HTML-invalid characters ('#x00' to '#x08', '#x0b' to '#x0c', '#x0e' to '#x1f' -- i.e., code-points 0 to 31 except those for the 'carriage return', 'line feed' and 'tab' characters). It (function 'hl_tag()') also removes the useless and potentially dangerous (in some Opera versions on Windows) soft-hyphen character ('#xad' -- code-point '173') from attribute values. Where required, the characters '<', '>', '&', and '"' are converted to entities.

Valid characters in HTML are '#x9', '#xa', '#xd', and '#x20' to '#x10ffff'. Characters that are discouraged (see section:- #5.1) but not invalid are not removed.

However, with '$config["clean_ms_char"]' set as '1' or '2', most of the discouraged characters (code-points 127 to 159) that many Microsoft applications incorrectly use (often as as per the 'Windows 1252' encoding system), and the character for code-point '133', are converted to appropriate decimal numerical entities -- see appendix in section:- #5.4. This can help avoid some display issues arising from copying-pasting of content.

With '$config["clean_ms_char"]' set as '2', characters '#x82', '#x91', and '#x92' (for special single-quotes), and '#x84', '#x93', and '#x94' (for special double-quotes) are converted to ordinary single and double quotes respectively and not to entities.

The character values are replaced with entities/characters and not character values referred to by the entities/characters to keep this task independent of the character-encoding of input text. This parameter need not be used if authors do not copy-paste Microsoft-created text.


-- 3.2  Character references/entities ------------------------------o


Valid character entities take the form '&*;' where '*' is '#x' followed by a hexadecimal number (hexadecimal numeric entity; like '&#xA0;' for non-breaking space), or alphanumeric like 'gt' (external or named entity; like '&nbsp;' for non-breaking space), or '#' followed by a number (decimal numeric entity; like '&#160;' for non-breaking space).

htmLawed (function 'hl_ent()'):

*  Neutralizes entities with multiple leading zeroes or missing semi-colons (potentially dangerous)

*  Lowercases the 'X' (XML compliance) and 'A-F' of hexadecimal numeric entities

*  Neutralizes entities referring to characters that are invalid `or` discouraged in HTML

*  Neutralizes named entities that are not in HTML specs.

*  Optionally converts valid HTML-specific named entities except '&gt;', '&lt;', '&quot;', and '&amp;' to decimal numeric ones (but hexadecimal if $config["hexdec_entity"] is '2') for generic XML compliance. For this, '$config["named_entity"]' should be '1'.

*  Optionally converts hexadecimal numeric entities to the more widely supported decimal ones. For this, '$config["hexdec_entity"]' should be '0'.

*  Optionally converts decimal numeric entities to the hexadecimal ones. For this, '$config["hexdec_entity"]' should be '2'.


-- 3.3  HTML elements ----------------------------------------------o


htmLawed can be configured to allow only certain HTML elements (tags) in the input. Un-permitted elements (just tag-content, and not element-content), based on '$config["keep_bad"]', are either `neutralized` (converted to plain text by entitification of '<' and '>') or removed.

E.g., with only 'em' permitted:

  Input:

      <em>My</em> website is <a href="http://a.com>a.com</a>.

  Output, with '$config["keep_bad"] = 0':

      <em>My</em> website is a.com.

  Output, with '$config["keep_bad"]' not '0':

      <em>My</em> website is &lt;a href=""&gt;a.com&lt;/a&gt;.

See section:- #3.3.3 for differences between the various non-zero '$config["keep_bad"]' values.

htmLawed by default permits these 86 elements:

    a, abbr, acronym, address, applet, area, b, bdo, big, blockquote, br, button, caption, center, cite, code, col, colgroup, dd, del, dfn, dir, div, dl, dt, em, embed, fieldset, font, form, h1, h2, h3, h4, h5, h6, hr, i, iframe, img, input, ins, isindex, kbd, label, legend, li, map, menu, noscript, object, ol, optgroup, option, p, param, pre, q, rb, rbc, rp, rt, rtc, ruby, s, samp, script, select, small, span, strike, strong, sub, sup, table, tbody, td, textarea, tfoot, th, thead, tr, tt, u, ul, var

Except for 'embed' (included because of its wide-spread use) and the Ruby elements ('rb', 'rbc', 'rp', 'rt', 'rtc', 'ruby'; part of XHTML 1.1), these are all the elements in HTML 4/XHTML 1 specs. Strict-specific specs. exclude 'center', 'dir', 'font', 'isindex', 'menu', 's', 'strike', and 'u'.

When '$config["elements"]', which specifies allowed elements, is defined and not empty or set to '0', the default set is not used, unless one uses the '*' wild-card to use the default set as such, or add more elements to or remove any from it.

Example '$config["elements"]' values:

*  'a, blockquote, code, em, strong' - only 'a', 'blockquote', 'code', 'em', and 'strong'
*  '*' - all (the default set)
*  '-*' - none
*  '*-script' - the default set excluding 'script'
*  '* -center -dir -font -isindex -menu -s -strike -u' - only XHTML-Strict elements
*  '*+noembed-script' - 'noembed' and the default set excluding 'script'

*Note*: Even if an element that is not in the default set is allowed through '$config["elements"]', like 'noembed' in the last example, it will eventually be removed during tag balancing unless such balancing is turned off ('$config["balance"]' set to '0'). Currently, the only way around this, which actually is simple, is to edit the various arrays in the function 'hl_bal()' to accommodate the element and its nesting properties.

*A possibly second way to specify allowed elements* is to set '$config["parent"]' to an element name that supposedly will hold the input, and to set '$config["balance"]' to '1'. During tag balancing (see section:- #3.3.3), all elements that cannot legally nest inside the parent element will be removed. The parent element is auto-reset to 'div' if '$config["parent"]' is empty, 'body', or an element not in htmLawed's default set of 86 elements.

`Tag transformation` is possible for improving XHTML-Strict compliance -- most of the deprecated elements are removed or converted to valid XHTML-Strict ones; see section:- #3.3.2.


.. 3.3.1  Handling of comments and CDATA sections ...................


'CDATA' sections have the format '<![CDATA[...anything but not "]]>"...]]>', and HTML comments, '<!--...anything but not "-->"... -->'. Neither HTML comments nor 'CDATA' sections can reside inside tags. HTML comments can exist anywhere else, but 'CDATA' sections can exist only where plain text is allowed (e.g., immediately inside 'td' element content but not immediately inside 'tr' element content).

htmLawed (function 'hl_cmtcd()') handles HTML comments or 'CDATA' sections depending on the values of '$config["comment"]' or '$config["cdata"]'. If '0', such markup is not looked for and the text is processed like plain text. If '1', it is removed completely. If '2', it is preserved but any '<', '>' and '&' inside are changed to entities. If '3', they are left as such.

Note that for the last two cases, HTML comments and 'CDATA' sections will always be removed from tag content (function 'hl_tag()').

Examples:

  Input:
    <!-- home link --><a href="home.htm"><![CDATA[x=&y]]>Home</a>
  Output ('$config["comment"] = 0, $config["cdata"] = 2'):
    &lt;-- home link --&gt;<a href="home.htm"><![CDATA[x=&amp;y]]>Home</a>
  Output ('$config["comment"] = 1, $config["cdata"] = 2'):
    <a href="home.htm"><![CDATA[x=&amp;y]]>Home</a>
  Output ('$config["comment"] = 2, $config["cdata"] = 2'):
    <!-- home link --><a href="home.htm"><![CDATA[x=&amp;y]]>Home</a>
  Output ('$config["comment"] = 2, $config["cdata"] = 1'):
    <!-- home link --><a href="home.htm">Home</a>
  Output ('$config["comment"] = 3, $config["cdata"] = 3'):
    <!-- home link --><a href="home.htm"><![CDATA[x=&y]]>Home</a>

For standard-compliance, comments are given the form '<!--comment -->', and any '--' in the content is made '-'.


.. 3.3.2  Tag-transformation for better XHTML-Strict ................o


If '$config["make_tag_strict"]' is set and not '0', following non-XHTML-Strict elements (and attributes), even if admin-permitted, are mutated as indicated (element content remains intact; function 'hl_tag2()'):

*  applet - (based on '$config["make_tag_strict"]', unchanged ('1') or removed ('2'))
*  center - 'div style="text-align: center;"'
*  dir - 'ul'
*  embed - (based on '$config["make_tag_strict"]', unchanged ('1') or removed ('2'))
*  font (face, size, color) -	'span style="font-family: ; font-size: ; color: ;"' (size transformation reference:- http://style.cleverchimp.com/font_size_intervals/altintervals.html)
*  isindex - (based on '$config["make_tag_strict"]', unchanged ('1') or removed ('2'))
*  menu - 'ul'
*  s - 'span style="text-decoration: line-through;"'
*  strike - 'span style="text-decoration: line-through;"'
*  u - 'span style="text-decoration: underline;"'

For an element with a pre-existing 'style' attribute value, the extra style properties are appended.

Example input:

    <center>
     The PHP <s>software</s> script used for this <strike>web-page</strike> webpage is <font style="font-weight: bold " face=arial size='+3' color   =  "red  ">htmLawedTest.php</font>, from <u style= 'color:green'>PHP Labware</u>.
    </center>

Output:

    <div style="text-align: center;">
     The PHP <span style="text-decoration: line-through;">software</span> script used for this <span style="text-decoration: line-through;">web-page</span> webpage is <span style="font-weight: bold; font-family: arial; color: red; font-size: 200%;">htmLawedTest.php</span>, from <span style="color:green; text-decoration: underline;">PHP Labware</span>.
    </div>


-- 3.3.3  Tag balancing and proper nesting -------------------------o


If '$config["balance"]' is set to '1', htmLawed (function 'hl_bal()') will check and correct the input to have properly balanced tags and legal element content (i.e., any element nesting should be valid, and plain text may be present only in the content of elements that allow them).

Depending on the value of '$config["keep_bad"]' (see section:- #2.2 and section:- #3.3), illegal content may be removed or neutralized to plain text:

'0' - remove; this option is available only to maintain Kses-compatibility (see section:- #2.6)
'1' - neutralize both tags and element content
'2' - remove tags but neutralize element content
'3' and '4' - like '1' and '2' but remove if text ('pcdata') is invalid in parent element
'5' and '6' -  like '3' and '4' but line-breaks, tabs and spaces are left

An option like '1' is useful, e.g., when a writer previews his submission, whereas one like '5' is useful before content is finalized and made available to all.

Nesting/content rules for each of the 86 elements in htmLawed's default set (see section:- #3.3) are defined in function 'hl_bal()'. This means that if a non-standard element besides 'embed' is being permitted through '$config["elements"]', the element's tag content will end up getting removed if '$config["balance"]' is set to '1'.

Text and certain elements nested inside 'form' and 'map' need to be in block-level elements. This point is often missed during manual writing of HTML code. htmLawed attempts to address this during balancing. E.g., if the parent conatiner is set as 'form', the input 'B:<input type="text" value="b" />C:<input type="text" value="c" />' is converted to '<div>B:<input type="text" value="b" />C:<input type="text" value="c" /></div>'.


-- 3.4  Attributes ------------------------------------------------oo


htmLawed will only permit attributes in the HTML specs (including deprecated ones). It also permits some attributes for use with the 'embed' element (the non-standard 'embed' element is supported in htmLawed because of its widespread use), and the the 'xml:space' attribute (valid only in XHTML 1.1). List of these 111 attributes and elements they are allowed in is in section:- #5.2.

When '$config["deny_attribute"]' is not set, or set to '0', or empty ('""'), all the 111 attributes are permitted. Otherwise, '$config["deny_attribute"]' can be set as a list of comma-separated names of the permitted attributes. 'on*' can be used to refer to the group of potentially dangerous, script-accepting attributes: 'onblur', 'onchange', 'onclick', 'ondblclick', 'onfocus', 'onkeydown', 'onkeypress', 'onkeyup', 'onmousedown', 'onmousemove', 'onmouseout', 'onmouseover', 'onmouseup', 'onreset', 'onselect' and 'onsubmit'.

htmLawed (function 'hl_tag()') also:

*  Lower-cases attribute names
*  Removes duplicate attributes (last one stays)
*  Gives attributes the form 'name="value"' and single-spaces them, removing unnecessary white space characters
*  Provides `required` attributes (see section:- #3.4.1)
*  Double-quotes values and escapes any '"' inside
*  Removes unnecessary white-spaces and possibly dangerous soft-hyphens ('#xad') from the values


.. 3.4.1  Auto-addition of XHTML-required attributes ................


If indicated attributes for following elements are found missing, htmLawed (function 'hl_tag()') will add them (with values same as attribute names unless indicated otherwise below):

*  area - alt ('area')
*  area, img - src, alt ('image')
*  bdo - dir ('ltr')
*  form - action
*  map - name
*  optgroup - label
*  param - name
*  script - type ('text/javascript')
*  textarea - rows ('10'), cols ('50')

Additionally, with '$config["xml:lang"]' set to '1' or '2', if the 'lang' but not the 'xml:lang' attribute is declared, then the latter is added too, with a value copied from that of 'lang'. This is for better standard-compliance. With '$config["xml:lang"]' set to '2', the 'lang' attribute is removed (XHTML 1.1 specs.).

Note that the 'name' attribute for 'map', invalid in XHTML 1.1, is also transformed if required -- see section:- #3.4.6.


.. 3.4.2  Duplicate/invalid 'ID' values ............................o


If '$config["unique_ids"]' is '1', htmLawed (function 'hl_tag()') removes 'ID' attributes with values that are not XHTML-compliant (must begin with a letter and can contain letters, digits, ':', '.', '-' and '_') or duplicated. If '$config["unique_ids"]' is a word, any duplicate but otherwise valid value will be appropriately prefixed with the word to ensure its uniqueness. The word should begin with a letter and should contain only letters, numbers, ':', '.', '_' and '-'.

Even if multiple inputs need to be filtered (through multiple calls to htmLawed), htmLawed ensures uniqueness of 'ID' values as it uses a 'GLOBAL' variable ('$GLOBALS["hl_Ids"]' array). Further, an admin can restrict the use of certain 'ID' values by presetting the variable with them before htmLawed is called into use. E.g.:

    $GLOBALS['hl_Ids'] = array('top'=>1, 'bottom'=>1, 'myform'=>1); // IDs not allowed in input
    $processed = htmLawed($text); // filter input


.. 3.4.3  URL schemes (protocols) and scripts in attribute values ............o


htmLawed edits attributes that take URLs as values if they are found to contain unpermitted schemes. This is really a security feature to prevent Javascript, etc.,-based threats.

E.g., if the 'afp' schemes is not permitted, then '<a href="afp://domain.org">' becomes '<a href="denied:afp://domain.org">', and if Javascript is not permitted '<a onclick="javascript:xss();">' becomes '<a onclick="denied:javascript:xss();">'.

By default htmLawed permits these schemes in URLs for the 'href' attribute:

    aim, feed, file, ftp, gopher, http, https, irc, mailto, news, nntp, sftp, ssh, telnet

Also, by default, only 'file', 'http' and 'https' are permitted in attributes whose names start with 'o' (like 'onmouseover'), and in these attributes that accept URLs:

    action, cite, classid, codebase, data, href, longdesc, model, pluginspage, pluginurl, src, style, usemap

These default sets are used when '$config["schemes"]' is not defined in htmLawed argument values (see section:- #2.2). Else, '$config["schemes"]' is defined as a string of semi-colon-separated sub-strings of type 'attribute: comma-separated schemes'. E.g., 'href: mailto, http, https; onclick: javascript; src: http, https'. For unspecified attributes, 'file', 'http' and 'https' are permitted. This can be changed by passing schemes for '*' in '$config["schemes"]'. E.g., 'href: mailto, http, https; *: https, https'.

'*' can be put in a list of schemes to indicate that all protocols are allowed. E.g., 'style: *; img: http, https' results in protocols not being checked in 'style' attribute values. However, in such cases, any relative-to-absolute URL conversion, or vice versa, (section:- #3.4.4) will not be done.

As a side-note, one may find 'style: *' useful as URLs in 'style' attributes can be specified in a variety of ways, and the patterns that htmLawed uses to identify URLs may mistakenly identify non-URL text.

*Note*: If URL-accepting attributes other than those listed above are being allowed, then the scheme will not be checked unless the attribute has the string 'src' in it (e.g., the non-standard 'dynsrc' attribute) or its name starts with 'o'.


.. 3.4.4  Absolute & relative URLs in attribute values .............o


htmLawed can make absolute URLs in attributes like 'href' relative ('$config["abs_url"]' is '-1'), and vice versa ('$config["abs_url"]' is '1'). URLs in scripts are not considered for this, and so are URLs like '#section_6' (fragment), '?name=Tim#show' (starting with query string) and ';var=1?name=Tim#show' (starting with parameters). Further, this requires that '$config["base_url"]' be set properly, with the '://' and a trailing slash ('/'), with no query string, etc. E.g., 'file:///D:/page/', 'https://abc.com/x/y/' or 'http://localhost/demo/' are okay, but 'file:///D:/page/?help=1', 'abc.com/x/y/' and 'http://localhost/demo/index.htm' are not.

For making absolute URLs relative, only those URLs that have the '$config["base_url"]' at the beginning are converted. E.g., with '$config["base_url"] = "https://abc.com/x/y/"', 'https://abc.com/x/y/a.gif' and 'https://abc.com/x/y/z/b.gif' become 'a.gif' and 'z/b.gif' respectively, while 'https://abc.com/x/c.gif' is not changed.

When making relative URLs absolute, only values for scheme, network location (hostname) and path values in the base URL are inherited. See section:- #5.5 for more about the URL specification as per RFC 1808:- http://www.ietf.org/rfc/rfc1808.txt.


.. 3.4.5  Lower-cased, standard attribute values ....................o


Optionally, for standard-compliance, htmLawed (function 'hl_tag()') lower-cases standard attribute values to give, e.g., 'input type="password"' instead of 'input type="Password"', if '$config["lc_std_val"]' is '1'. Attribute values matching those listed below for any of the elements (plus those for the 'type' attribute of 'button' or'input') are lower-cased:

    all, baseline, bottom, button, center, char, checkbox, circle, col, colgroup, cols, data, default, file, get, groups, hidden, image, justify, left, ltr, middle, none, object, password, poly, post, preserve, radio, rect, ref, reset, right, row, rowgroup, rows, rtl, submit, text, top

    a, area, bdo, button, col, form, img, input, object, option, optgroup, param, script, select, table, td, tfoot, th, thead, tr, xml:space

Note that these `empty` (`minimized`) attributes are always assigned lower-cased values (same as the names):

    checked, compact, declare, defer, disabled, ismap, multiple, nohref, noresize, noshade, nowrap, readonly, selected


.. 3.4.6  Transformation of deprecated attributes ..................o


If '$config["no_deprecated_attr"]' is '0', then deprecated attributes (see appendix in section:- #5.2) are removed and, in most cases, their values are transformed to CSS style properties and added to the 'style' attributes (function 'hl_tag()').

*Note*: The attribute 'target' for 'a' is allowed even though it is not in XHTML 1.0 specs. This is because of the attribute's wide-spread use and browser-support, and because the attribute is valid in XHTML 1.1 onwards.

*  align - for 'img' with value of 'left' or 'right', becomes, e.g., 'float: left'; for 'div' and 'table' with value 'center', becomes 'margin: auto'; all others become, e.g., 'text-align: right'

*  bgcolor - E.g., 'bgcolor="#ffffff"' becomes 'background-color: #ffffff'
*  border - E.g., 'height= "10"' becomes 'height: 10px'
*  compact - 'font-size: 85%'
*  clear - E.g., 'clear="all" becomes 'clear: both'

*  height - E.g., 'height= "10"' becomes 'height: 10px' and 'height="*"' becomes 'height: auto'

*  hspace - E.g., 'hspace="10"' becomes 'margin-left: 10px; margin-right: 10px'
*  language - 'language="VBScript"' becomes 'type="text/vbscript"'
*  name - E.g., 'name="xx"' becomes 'id="xx"'
*  noshade - 'border-style: none; border: 0; background-color: gray; color: gray'
*  nowrap - 'white-space: nowrap'
*  size - E.g., 'size="10"' becomes 'height: 10px'
*  start - removed
*  type - E.g., 'type="i"' becomes 'list-style-type: lower-roman'
*  value - removed
*  vspace - E.g., 'vspace="10"' becomes 'margin-top: 10px; margin-bottom: 10px'
*  width - like 'height'

Example input:

    <img src="j.gif" alt="image" name="dad's" /><img src="k.gif" alt="image" id="dad_off" name="dad" />
    <br clear="left" />
    <hr noshade size="1" />
    <img name="img" src="i.gif" align="left" alt="image" hspace="10" vspace="10" width="10em" height="20" border="1" style="padding:5px;" />
    <table width="50em" align="center" bgcolor="red">
     <tr>
      <td width="20%">
       <div align="center">
        <h3 align="right">Section</h3>
        <p align="right">Para</p>
        <ol type="a" start="e"><li value="x">First item</li></ol>
       </div>
      </td>
      <td width="*">
       <ol type="1"><li>First item</li></ol>
      </td>
     </tr>
    </table>
    <br clear="all" />

Output with '$config["no_deprecated_attr"] = 1':

    <img src="j.gif" alt="image" /><img src="k.gif" alt="image" id="dad_off" />
    <br style="clear: left;" />
    <hr style="border-style: none; border: 0; background-color: gray; color: gray; size: 1px;" />
    <img src="i.gif" alt="image" width="10em" height="20" style="padding:5px; float: left; margin-left: 10px; margin-right: 10px; margin-top: 10px; margin-bottom: 10px; border: 1px;" id="img" />
    <table width="50em" style="margin: auto; background-color: red;">
     <tr>
      <td style="width: 20%;">
       <div style="margin: auto;">
        <h3 style="text-align: right;">Section</h3>
        <p style="text-align: right;">Para</p>
        <ol style="list-style-type: lower-latin;"><li>First item</li></ol>
       </div>
      </td>
      <td style="width: auto;">
       <ol style="list-style-type: decimal;"><li>First item</li></ol>
      </td>
     </tr>
    </table>
    <br style="clear: both;" />

For 'lang', deprecated in XHTML 1.1, transformation is taken care of through $config["xml:lang"] -- see section:- #3.4.1.

Attribute 'name' is deprecated in 'form', 'iframe', and 'img', and is replaced with 'id' if an 'ID' attribute doesn't exist and if the 'name' value is appropriate for an 'ID'. For such replacements for 'a' and 'map', for which the 'name' attribute is deprecated in XHTML 1.1, '$config["no_deprecated_attr"]' should be set to '2' (when set to '1', for these two elements, the 'name' attribute is retained).


-- 3.4.7  Anti-spam & 'href' ---------------------------------------o


htmLawed (function 'hl_tag()') can optionally check the 'href' attribute values (link addresses) as an anti-spam (email or link spam) measure.

If '$config["anti_mail_spam"]' is not '0', the '@' of email addresses in 'href' values like 'mailto:a@b.com' will be replaced with text specified by '$config["anti_mail_spam"]'. The text should be of a form that makes it clear to others that the address needs to be edited before a mail is sent. E.g., '<remove_this_antispam>@' (makes the example address 'a<remove_this_antispam>@b.com').

For regular links, one can choose to have a 'rel' attribute with 'nofollow' in its value (which tells some search engines to not follow a link). This can discourage link spammers. Additionally, or as an alternative, one can choose to empty the 'href' value altogether (disable the link).

For use of these options, '$config["anti_link_spam"]' should be set as an array with values 'regex1' and 'regex2', both or one of which can be empty (like 'array("", "regex2")') to indicate that that option is not to be used. Otherwise, 'regex1' or 'regex2' should be PHP- and PCRE-compatible regular expression patterns: 'href' values will be matched against them and those matching the pattern will accordingly be treated.

Note that the regular expressions should have `delimiters`, and be well-formed and preferably fast. Absolute efficiency/accuracy is often not needed.

As an example, to have a 'rel' attribute with 'nofollow' for all links, and to disable links that do not point to domains 'abc.com' and 'xyz.org':

    $config["anti_link_spam"] = array('`.`', '`://\W*(?!(abc\.com|xyz\.org))`');


-- 3.4.8  Inline style properties ----------------------------------o


htmLawed can check URL schemes and dynamic expressions (to guard against Javascript, etc., script-based insecurities) in inline CSS style property values in the 'style' attributes. The CSS (2.1) properties like 'background-image' that accept URLs in their values are noted in section:- #5.3. Dynamic CSS expressions that allow scripting in the Internet Explorer browser, and can be a vulnerability, can be removed from property values by setting '$cfg["css_expression"]' to '1'.


-- 3.5  Simple configuration directive for most valid XHTML -------oo


If '$config["valid_xhtml"]' is set to '1', some relevant '$config' parameters (indicated by '~' in section:- #2.2) are auto-adjusted. This allows one to pass the '$config' argument with a simpler value. If a value for a parameter auto-set through 'valid_xhtml' is still manually provided, then that value will over-ride the auto-set value. E.g., for the 'unique_ids' parameter.


-- 3.6  Using a hook function --------------------------------------o


If '$config["hook"]' is not set to '0', then htmLawed will allow preliminarily processed input to be altered by a hook function (name set in '$config["hook"]') before starting the main work (but after handling of characters, entities, HTML comments and 'CDATA' sections -- see code for function 'htmLawed()').

The hook function also allows one to alter the `finalized` values of '$config' and '$spec'.


== 4  Other =======================================================oo


-- 4.1  Support -----------------------------------------------------


A careful reading of this documentation may answer many questions.

Software updates and forum-based community-support may be found at http://www.bioinformatics.org/phplabware/internal_utilities. For general PHP issues (not htmLawed-specific), support may be found through internet searches and at http://php.net.


-- 4.2  Known issues -----------------------------------------------o


See section:- #2.8.


-- 4.3  Change-log -------------------------------------------------o


v1.0 - released 2 November 2007


-- 4.4  Testing ----------------------------------------------------o


To test htmLawed using a form-based interface, a demo:- htmLawedTest.php web-page is provided ('htmLawed.php' and 'htmLawedTest.php' should be in the same directory on the web-server). Input can be typed in or copy-pasted. A file with test-cases:- htmLawed_TESTCASE.txt is provided with the htmLawed distribution.


-- 4.5  Upgrade ----------------------------------------------------o


Upgrading is as simple as replacing the previous version of 'htmLawed.php' (assuming it was not modified for customized features). As htmLawed output is almost always used in static documents, upgrading should not affect old, finalized content.


-- 4.6  Donate -----------------------------------------------------o


A donation in any currency and amount to appreciate or support this software can be sent by PayPal:- http://paypal.com to this email address: drpatnaik at yahoo dot com.

Thank you!


== 5  Appendices ==================================================oo


-- 5.1  Characters discouraged in XHTML -----------------------------


These are not invalid, even though some validators may issue messages stating otherwise.

#x7f to #x84, #x86 to #x9f, #xfdd0 to #xfddf, #x1fffe to #x1ffff, #x2fffe to #x2ffff, #x3fffe to #x3ffff, #x4fffe to #x4ffff, #x5fffe to #x5ffff, #x6fffe to #x6ffff, #x7fffe to #x7ffff, #x8fffe to #x8ffff, #x9fffe to #x9ffff, #xafffe to #xaffff, #xbfffe to #xbffff, #xcfffe to #xcffff, #xdfffe to #xdffff, #xefffe to #xeffff, #xffffe to #xfffff, #x10fffe to #x10ffff


-- 5.2  Valid attribute-element combinations -----------------------o


Valid attribute-element combinations as per W3C specs:

*  includes deprecated attributes (marked '^'), and attributes for the non-standard 'embed' element (marked '*')
*  only non-frameset, HTML body elements
*  'name' for 'a' and 'map', and 'lang' are invalid in XHTML 1.1
*  'target' is valid for 'a' in XHTML 1.1 and higher
*  'xml:space' is only for XHTML 1.1

abbr - td, th
accept - form, input
accept-charset - form
accesskey - a, area, button, input, label, legend, textarea
action - form
align - caption^, embed, applet, iframe, img^, input^, object^, legend^, table^, hr^, div^, h1^, h2^, h3^, h4^, h5^, h6^, p^, col, colgroup, tbody, td, tfoot, th, thead, tr
alt - applet, area, img, input
archive - applet, object
axis - td, th
bgcolor - embed, table^, tr^, td^, th^
border - table, img^, object^
cellpadding - table
cellspacing - table
char - col, colgroup, tbody, td, tfoot, th, thead, tr
charoff - col, colgroup, tbody, td, tfoot, th, thead, tr
charset - a, script
checked - input
cite - blockquote, q, del, ins
classid - object
clear - br^
code - applet
codebase - object, applet
codetype - object
color - font
cols - textarea
colspan - td, th
compact - dir, dl^, menu, ol^, ul^
coords - area, a
data - object
datetime - del, ins
declare - object
defer - script
dir - bdo
disabled - button, input, optgroup, option, select, textarea
enctype - form
face - font
for - label
frame - table
frameborder - iframe
headers - td, th
height - embed, iframe, td^, th^, img, object, applet
href - a, area
hreflang - a
hspace - applet, img^, object^
ismap - img, input
label - option, optgroup
language - script^
longdesc - img, iframe
marginheight - iframe
marginwidth - iframe
maxlength - input
method - form
model* - embed
multiple - select
name - button, embed, textarea, applet^, select, form^, iframe^, img^, a^, input, object, map^, param
nohref - area
noshade - hr^
nowrap - td^, th^
object - applet
onblur - a, area, button, input, label, select, textarea
onchange - input, select, textarea
onfocus - a, area, button, input, label, select, textarea
onreset - form
onselect - input, textarea
onsubmit - form
pluginspage* - embed
pluginurl* - embed
prompt - isindex
readonly - textarea, input
rel - a
rev - a
rows - textarea
rowspan - td, th
rules - table
scope - td, th
scrolling - iframe
selected - option
shape - area, a
size - hr^, font, input, select
span - col, colgroup
src - embed, script, input, iframe, img
standby - object
start - ol^
summary - table
tabindex - a, area, button, input, object, select, textarea
target - a^, area, form
type - a, embed, object, param, script, input, li^, ol^, ul^, button
usemap - img, input, object
valign - col, colgroup, tbody, td, tfoot, th, thead, tr
value - input, option, param, button, li^
valuetype - param
vspace - applet, img^, object^
width - embed, hr^, iframe, img, object, table, td^, th^, applet, col, colgroup, pre^
xml:space - pre, script, style

These are allowed in all but the shown elements:

class - param, script
dir - applet, bdo, br, iframe, param, script
id - script
lang - applet, br, iframe, param, script
onclick - applet, bdo, br, font, iframe, isindex, param, script
ondblclick - applet, bdo, br, font, iframe, isindex, param, script
onkeydown - applet, bdo, br, font, iframe, isindex, param, script
onkeypress - applet, bdo, br, font, iframe, isindex, param, script
onkeyup - applet, bdo, br, font, iframe, isindex, param, script
onmousedown - applet, bdo, br, font, iframe, isindex, param, script
onmousemove - applet, bdo, br, font, iframe, isindex, param, script
onmouseout - applet, bdo, br, font, iframe, isindex, param, script
onmouseover - applet, bdo, br, font, iframe, isindex, param, script
onmouseup - applet, bdo, br, font, iframe, isindex, param, script
style - param, script
title - param, script
xml:lang - applet, br, iframe, param, script


-- 5.3  CSS (2.1) properties accepting URLs ------------------------o


background
background-image
content
cue-after
cue-before
cursor
list-style
list-style-image
play-during


-- 5.4  Microsoft character replacements ---------------------------o


Key: 'd' double, 'l' left, 'q' quote, 'r' right, 's.' single

Code point - hexadecimal value - replacement entity & character

127 - 7f - (removed)
128 - 80 - &#8364; - euro
129 - 81 - (removed)
130 - 82 - &#8218; - baseline s. q
131 - 83 - &#402; - florin
132 - 84 - &#8222; - baseline d q
133 - 85 - &#8230; - ellipsis
134 - 86 - &#8224; - dagger
135 - 87 - &#8225; - d dagger
136 - 88 - &#710; - circumflex accent
137 - 89 - &#8240; - permile
138 - 8a - &#352; - S Hacek
139 - 8b - &#8249; - l s. guillemet
140 - 8c - &#338; - OE ligature
141 - 8d - (removed)
142 - 8e - &#381; - Z dieresis
143 - 8f - (removed)
144 - 90 - (removed)
145 - 91 - &#8216; - l s. q
146 - 92 - &#8217; - r s. q
147 - 93 - &#8220; - l d q
148 - 94 - &#8221; - r d q
149 - 95 - &#8226; - bullet
150 - 96 - &#8211; - en dash
151 - 97 - &#8212; - em dash
152 - 98 - &#732; - tilde accent
153 - 99 - &#8482; - trademark
154 - 9a - &#353; - s Hacek
155 - 9b - &#8250; - r s. guillemet
156 - 9c - &#339; - oe ligature
157 - 9d - (removed)
158 - 9e - &#382; - z dieresis
159 - 9f - &#376; - Y dieresis


-- 5.5  URL format -------------------------------------------------o


An `absolute` URL has a 'protocol' or 'scheme', a 'network location' or 'hostname', and, optional 'path', 'parameters', 'query' and 'fragment' segments. Thus, an absolute URL has this generic structure:

    (scheme) : (//network location) /(path) ;(parameters) ?(?query) #(fragment)

The schemes can only contain letters, digits, '+', '.' and '-'. Hostname is the portion after the '//' and up to the first '/' (if any; else, up to the end) when ':' is followed by a '//' (e.g., 'abc.com' in 'ftp://abc.com/def'); otherwise, it consists of everything after the ':' (e.g., 'def@abc.com' in mailto:def@abc.com').

`Relative` URLs do not have explicit schemes and network locations; such values are inherited from a `base` URL.

___________________________________________________________________oo


@@description: htmLawed PHP software is a free, open-source, customizable HTML purifier and filter
@@encoding:utf-8
@@keywords: htmLawed, HTM, HTML, converter, filter, formatter, purifier, sanitizer, XSS, input, PHP, software, code, script, security
@@language:en
@@title: htmLawed documentation