5.2  Valid attribute-element combinations
(to top)
  *  includes deprecated attributes (marked 
^), attributes for microdata (marked 
*), some non-standard attributes for 
embed (marked 
**), and the non-standard 
bordercolor; can have multiple comma-separated values (marked 
%); can have multiple space-separated values (marked 
$)
  *  only non-frameset, HTML body elements
  *  
name for 
a and 
map, and 
lang are invalid in XHTML 1.1
  *  
xml:space is only for XHTML 1.1
  *  excludes data-* and author-specified, non-standard attributes of custom elements
  abbr - td, th
  accept - form, input
  accept-charset - form
  action - form
  align - applet, caption^, col, colgroup, div^, embed, h1^, h2^, h3^, h4^, h5^, h6^, hr^, iframe, img^, input^, legend^, object^, p^, table^, tbody, td, tfoot, th, thead, tr
  allowfullscreen - iframe
  alt - applet, area, img, input
  archive - applet, object
  async - script
  autocomplete - input
  autofocus - button, input, keygen, select, textarea
  autoplay - audio, video
  axis - td, th
  bgcolor - embed, table^, tbody^, td^, tfoot^, th^, thead^, tr^
  border - img, object^, table
  bordercolor - table, td, tr
  cellpadding - table
  cellspacing - table
  challenge - keygen
  char - col, colgroup, tbody, td, tfoot, th, thead, tr
  charoff - col, colgroup, tbody, td, tfoot, th, thead, tr
  charset - a, script
  checked - command, input
  cite - blockquote, del, ins, q
  classid - object
  clear - br^
  code - applet
  codebase - object, applet
  codetype - object
  color - font
  cols - textarea
  colspan - td, th
  compact - dir, dl^, menu, ol^, ul^
  content - meta
  controls - audio, video
  coords - area, a
  crossorigin - img
  data - object
  datetime - del, ins, time
  declare - object
  default - track
  defer - script
  dir - bdo
  dirname - input, textarea
  disabled - button, command, fieldset, input, keygen, optgroup, option, select, textarea
  download - a
  enctype - form
  face - font
  flashvars** - embed
  for - label, output
  form - button, fieldset, input, keygen, label, object, output, select, textarea
  formaction - button, input
  formenctype - button, input
  formmethod - button, input
  formnovalidate - button, input
  formtarget - button, input
  frame - table
  frameborder - iframe
  headers - td, th
  height - applet, canvas, embed, iframe, img, input, object, td^, th^, video
  high - meter
  href - a, area, link
  hreflang - a, area, link
  hspace - applet, embed, img^, object^
  icon - command
  ismap - img, input
  keytype - keygen
  keyparams - keygen
  kind - track
  label - command, menu, option, optgroup, track
  language - script^
  list - input
  longdesc - img, iframe
  loop - audio, video
  low - meter
  marginheight - iframe
  marginwidth - iframe
  max - input, meter, progress
  maxlength - input, textarea
  media - a, area, link, source, style
  mediagroup - audio, video
  method - form
  min - input, meter
  model** - embed
  multiple - input, select
  muted - audio, video
  name - a^, applet^, button, embed, fieldset, form^, iframe^, img^, input, keygen, map^, object, output, param, select, slot, textarea
  nohref - area
  noshade - hr^
  novalidate - form
  nowrap - td^, th^
  object - applet
  open - details, dialog
  optimum - meter
  pattern - input
  ping - a, area
  placeholder - input, textarea
  pluginspage** - embed
  pluginurl** - embed
  poster - video
  pqg - keygen
  preload - audio, video
  prompt - isindex
  pubdate - time
  radiogroup* - command
  readonly - input, textarea
  required - input, select, textarea
  rel$ - a, area, link
  rev - a
  reversed - old
  rows - textarea
  rowspan - td, th
  rules - table
  sandbox - iframe
  scope - td, th
  scoped - style
  scrolling - iframe
  seamless - iframe
  selected - option
  shape - area, a
  size - font, hr^, input, select
  sizes - img, link, source
  span - col, colgroup
  src - audio, embed, iframe, img, input, script, source, track, video
  srcdoc~ - iframe
  srclang~ - track
  srcset~% - img, link, source
  standby - object
  start - ol
  step~ - input
  summary - table
  target - a, area, form
  type - a, area, button, command, embed, input, li, link, menu, object, ol, param, script, source, style, ul
  typemustmatch~ - object
  usemap - img, input, object
  valign - col, colgroup, tbody, td, tfoot, th, thead, tr
  value - button, data, input, li, meter, option, param, progress
  valuetype - param
  vspace - applet, embed, img^, object^
  width - applet, canvas, col, colgroup, embed, hr^, iframe, img, input, object, pre^, table, td^, th^, video
  wmode - embed
  wrap~ - textarea
  The following attributes, including event-specific ones and attributes of ARIA and microdata specifications, are considered global and allowed in all elements:
  accesskey, autocapitalize, autofocus, aria-activedescendant, aria-atomic, aria-autocomplete, aria-braillelabel, aria-brailleroledescription, aria-busy, aria-checked, aria-colcount, aria-colindex, aria-colindextext, aria-colspan, aria-controls, aria-current, aria-describedby, aria-description, aria-details, aria-disabled, aria-dropeffect, aria-errormessage, aria-expanded, aria-flowto, aria-grabbed, aria-haspopup, aria-hidden, aria-invalid, aria-keyshortcuts, aria-label, aria-labelledby, aria-level, aria-live, aria-multiline, aria-multiselectable, aria-orientation, aria-owns, aria-placeholder, aria-posinset, aria-pressed, aria-readonly, aria-relevant, aria-required, aria-roledescription, aria-rowcount, aria-rowindex, aria-rowindextext, aria-rowspan, aria-selected, aria-setsize, aria-sort, aria-valuemax, aria-valuemin, aria-valuenow, aria-valuetext, class, contenteditable, contextmenu, dir, draggable, dropzone, enterkeyhint, hidden, id, inert, inputmode, is, itemid, itemprop, itemref, itemscope, itemtype, lang, nonce, onabort, onblur, oncanplay, oncanplaythrough, onchange, onclick, oncontextmenu, oncopy, oncuechange, oncut, ondblclick, ondrag, ondragend, ondragenter, ondragleave, ondragover, ondragstart, ondrop, ondurationchange, onemptied, onended, onerror, onfocus, onformchange, onforminput, oninput, oninvalid, onkeydown, onkeypress, onkeyup, onload, onloadeddata, onloadedmetadata, onloadend, onloadstart, onlostpointercapture, onmousedown, onmousemove, onmouseout, onmouseover, onmouseup, onmousewheel, onpaste, onpause, onplay, onplaying, onpointercancel, ongotpointercapture, onpointerdown, onpointerenter, onpointerleave, onpointermove, onpointerout, onpointerover, onpointerup, onprogress, onratechange, onreadystatechange, onreset, onsearch, onscroll, onseeked, onseeking, onselect, onshow, onstalled, onsubmit, onsuspend, ontimeupdate, ontoggle, ontouchcancel, ontouchend, ontouchmove, ontouchstart, onvolumechange, onwaiting, onwheel, onauxclick, oncancel, onclose, oncontextlost, oncontextrestored, onformdata, onmouseenter, onmouseleave, onresize, onsecuritypolicyviolation, onslotchange, role, slot, spellcheck, style, tabindex, title, translate, xmlns, xml:base, xml:lang, xml:space
  Custom 
data-* attributes, where the first three characters of the value of 
star (*) after lower-casing do not equal 
xml and the value of 
star does not have a colon (:), equal-to (=), newline, solidus (/), space, tab, or any A-Z character, are also considered global and allowed in all elements.
5.6  Brief on htmLawed code
(to top)
  Much of the code's logic and reasoning can be understood from the documentation above.
  The 
output of htmLawed is a text string containing the processed input. There is no custom error tracking.
  
Function arguments for htmLawed are:
  *  
$in - first argument; a text string; the 
input text to be processed. Any extraneous slashes added by PHP when 
magic quotes are enabled should be removed beforehand using PHP's 
stripslashes() function.
  *  
$config - second argument; an associative array; optional; named 
$C within htmLawed code. The array has keys with names like 
balance and 
keep_bad, and the values, which can be boolean, string, or array, depending on the key, are read to accordingly set the 
configurable parameters (indicated by the keys). All configurable parameters receive some default value if the value to be used is not specified by the user through 
$config. 
Finalized $config is thus a filtered and possibly larger array.
  *  
$spec - third argument; a text string; optional. The string has rules, written in an htmLawed-designated format, 
specifying element-specific attribute and attribute value restrictions. Function 
hl_spec() is used to convert the string to an associative-array, named 
$S within htmLawed code, for internal use. 
Finalized $spec is thus an array.
  
Finalized $config and 
$spec are made 
global variables while htmLawed is at work. Values of any pre-existing global variables with same names are noted, and their values are restored after htmLawed finishes processing the input (to capture the 
finalized values, the 
show_settings parameter of 
$config should be used). Depending on 
$config, another global variable 
hl_Ids, to track 
id attribute values for uniqueness, may be set. Unlike the other two variables, this one is not reset (or unset) post-processing.
  Except for the main 
htmLawed() function, htmLawed's functions are 
name-spaced using the 
hl_ prefix. The 
functions and their roles are:
  *  
hl_attributeValue - check attribute values against 
$spec rules
  *  
hl_balance - balance tags and ensure proper nesting
  *  
hl_commentCdata - handle CDATA sections and HTML comments
  *  
hl_deprecatedElement - transform element tags
  *  
hl_entity - handle character entities
  *  
hl_regex - check syntax of a regular expression
  *  
hl_spec - convert 
$spec value to one used internally
  *  
hl_tag - handle element tags and attributes
  *  
hl_tidy - compact/beautify HTML
  *  
hl_url - check URL-containing values
  *  
hl_version - report htmLawed version
  *  
htmLawed - main function
  
htmLawed() finalizes 
$spec (with the help of 
hl_spec()) and 
$config, and globalizes them. Finalization of 
$config involves setting default values if an inappropriate or invalid one is supplied. This includes calling 
hl_regex() to check well-formedness of regular expression patterns if such expressions are user-supplied through 
$config. 
htmLawed() then removes invalid characters like nulls and 
x01 and appropriately handles entities using 
hl_entity(). HTML comments and CDATA sections are identified and treated as per 
$config with the help of 
hl_commentCdata(). When retained, the 
< and 
> characters identifying them, and the 
<, 
> and 
& characters inside them, are replaced with control characters (code-points 
1 to 
5) till any tag balancing is completed.
  After this 
initial processing htmLawed() identifies tags using regex and processes them with the help of 
hl_tag() --  a large function that analyzes tag content, filtering it as per HTML standards, 
$config and 
$spec. Among other things, 
hl_tag() transforms deprecated elements using 
hl_deprecatedElement(), removes attributes from closing tags, checks attribute values as per 
$spec rules using 
hl_attributeValue(), and checks URL protocols using 
hl_url(). 
htmLawed() performs tag balancing and nesting checks with a call to 
hl_balance(), and optionally compacts/beautifies the output with proper white-spacing with a call to 
hl_tidy(). The latter temporarily replaces white-space, and 
<, 
> and 
& characters inside 
pre, 
script and 
textarea elements, and HTML comments and CDATA sections with control characters (code-points 
1 to 
5, and 
7).
  htmLawed permits the use of custom code or 
hook functions at two stages. The first, called inside 
htmLawed(), allows the input text as well as the finalized 
$config and 
$spec values to be altered right after the initial processing (see 
section 3.7). The second is called by 
hl_tag() once the tag content is finalized (see 
section 3.4.9).
  The functionality of htmLawed is dictated by the external HTML standards. The code of htmLawed is thus written for a clear-cut aim, with not much concern for tweaking by other developers. The code is only minimally annotated with comments -- it is not meant to instruct. PHP developers familiar with the HTML specifications will see the logic, and others can always refer to the htmLawed documentation.
htmLawed 1.2.15
Copyright Santosh Patnaik
Dual licensed with LGPL 3 and GPL 2+
A PHP Labware internal utility - https://bioinformatics.org/phplabware/internal_utilities/htmLawed