1

Topic: Nested list parsing bug?

The following which is a valid html chunk (as far as I can tell) does not make it through htmLawed in the default settings on the test page correctly. 

<ul>
  <li>blah
    <ul>
      <li>bleh
    </ul>
  <li>zip
</ul>

Instead you get out

<ul>
  <li>blah
    <ul>
      <li>bleh
    </li></ul>
  zip
</li></ul>

With the final bullet point wrapped inside the same <li></li> as the first bullet (which is not the intent)

2

Re: Nested list parsing bug?

The test input you are putting in actually not valid; not even one of the three 'li' elements are closed properly.

When dealing with such inputs, an HTML parser and corrector algorithm, like htmLawed's, has to guess the correct nested structure. This can involve assuming that a closing tag was mistyped as an opening tag, that a closing tag was not put in at all, and so on.

Thus, the [ul]-[li]-[ul]-[li]-[/ul]-[li]-[/ul] flow can be 'corrected' in many different ways. htmLawed chooses the fastest one to give [ul]-[li]-[ul]-[li]-[/li]-[/ul]-[/li]-[/ul], which is valid, though not what the writer may have wanted.

3

Re: Nested list parsing bug?

Hmmm....

According to the w3c as far as I read it omitting the </li> is legal in html 4.01 transitional.  See both the example at the top of the page:
http://www.w3.org/TR/html401/struct/lists.html
As well as the line:
"...lists are made up of sequences of list items defined by the LI element (whose end tag may be omitted)."

I fully appreciate why this is a pain to deal with in htmlawed.  But it's a legal construct, one that many other tools manage to cope with unambiguously, and in my experience a very common one.  Certainly in the html that I've encountered finding closed <li> tags is the exception, not the rule.   Obviously in xml or html 5 things are different.  But 4.01 is world I'm obliged to play in.

In the particular application I'm writing at the moment the html in question is being created directly by the designMode/contentEditable tools built into the browser and at least in FF3.0 and Safari 3.1 the html generated by the browser has no </li> tags. 

So for me getting 'better' html is not an option. Nor is outlawing nested lists for my users.  Nor is having html that renders correctly, render incorrectly after I "clean" it with htmLawed.   I appreciate that it may not be be a high priority, or even practical to fix htmLawed to handle this construct.  If so just let me know and I'll start looking again at the alternatives which I originally discarded in favor of htmLawed.

4

Re: Nested list parsing bug?

You are right: HTML 4 Transitional does permit two dozen (http://meiert.com/en/blog/20080601/optional-tags-in-html-4/) [!] or so of closing tags to be omitted.

It seems to me that this issue of correct-but-unwanted nesting correction by htmLawed may be an issue only for the 'li', and 'dd' and 'dt' elements. Have you seen an issue with htmLawed for other elements when omitable closing tags are not there?

I may be able add an option for htmLawed to handle such cases. The logic may involve costly backtracking, though, in which case I will just suggest a mod for you to put in your code. Otherwise, very likely HTML Tidy, and perhaps HTML Purifier, may suit your needs.

5

Re: Nested list parsing bug?

Wow!  I didn't realize the list was that long.  Certainly for my purposes  li, dd and dt would do the trick.  I'm only dealing with html snippets that will get dumped into an already valid template.  And I'm restricting things to a small set  core tags (no <area> or <param> to worry about).  Handling these common cases would seem to fit the spirit of htmLawed: that it handles the real-world cases efficiently (unlike some other tools which handle all the edge cases but are too slow to be useful (to me anyway)).

6

Re: Nested list parsing bug?

If the input HTML is not human-written, why not bypass tag-balancing? You can still get htmLawed to do tag-filtering, etc.

Tag-balancing is optional in htmLawed, and it is during this step that nestings are checked and corrected.

7

Re: Nested list parsing bug?

Good point. I hadn't played with the options enough to think about the fact that the filtering and the balancing were independent.   I'll try just turning off the balancing and make sure that isn't going to cause me problems elsewhere (some of the machine generated stuff out of IE is less than pretty...)  I'll post again if that doesn't do the trick.  Thanks for the idea.

8

Re: Nested list parsing bug?

Tag-balancing currently does these types of things:

1. For un-nested elements, put closing tags if absent.
2. For nested elements, remove/neutralize illegal nesting.
3. For nested elements with missing closing tags, add the closing tags.
4. For inline elements inside 'blockquote' and 'form', like 'input', that are not inside a block element, add the container block element 'div'.

None of these cases are likely to happen with machine-generated HTML.

htmLawed handles element/attribute filtering/normalization, removal of unnecessary/wrong closing tags, etc., before tag-balancing.

One drawback to turning off tag balancing is that one cannot convert HTML 4 with omitted tags to XHTML (XML), and this may cause problems when the content is viewed as non-HTML Transitional (newsfeed readers, web-pages served as XHTML, etc.).

9

Re: Nested list parsing bug?

The new release, 1.1.1, of htmLawed has a fix for the issue. Besides li, colgroup, dd, dt, option, td, tr, th, thead, tbody and tfoot are also covered.