1 (edited by j8 2008-03-22 12:56:42)

Topic: kses compatibility question

Edit: nvm

i was inquiring if i could set a whitelist of attributes for each element in teh same way it is done in kses... after testing it it seems to work in the same way.. i cant for the likes of me find out where ur parsing the attributes from the input array ('tag'=>('attribute' =>1, etc) ).. but it works nonetheless

excellent work..

btw, for teh Word thing. it does break in utf-8 enviroments, but a simple:
        $html = html_entity_decode($html, ENT_COMPAT, "utf-8");

be4 u run it from this lib fixes that up nicely and there is no need for the flag to be set in htmlawed.

thanks again.

2

Re: kses compatibility question

For compatibility with code (such as those in WordPress) that call kses, htmLawed simply checks (function 'htmLawed') if the third argument supplied to it, '$spec', is an array. If not, it processes '$spec' to an array (function 'hl_spec'). The net effect is the same, though element-specific rules can be more easily and more finely-defined when '$spec' is not specified as an array.

E.g., both these will allow only 'href' and 'title':

$out = htmLawed($in, $cf, array('a'=>array('href'=>1, 'title'=>1)));
$out = htmLawed($in, $cf, 'a=-*, href, title');

The latter way is also simpler when one wants to deny just a single attribute, say 'id', or for stricter attribute value checks:

$out = htmLawed($in, $cf, 'a=-id');
$out = htmLawed($in, $cf, 'a=href(maxlen=50)');

Re: the Word/UTF-8 aspect, the entity-decoding code you suggest indeed works.

3 (edited by j8 2008-03-22 20:03:37)

Re: kses compatibility question

cheers.. i missed the bit where it called the kses_hook ;)

i have another question...

let me explain how i m using this lib firstly:

i have a bbcode parser and unparser. To achieve the unparsing i used to embed comments.. eg... [ u]text[ /u] would become during parsing -> <span class="underline">text<--u--></span> .. then i would pass this through kses and then save hte parsed text, then during editing, i would unparse it back to [ u]text[ /u] .. as u can probably see. the comments are there so the unparser can know what this span tag corresponds to..

this worked well for kses as it did no re-ordering of the elements (which is the reason i want to update to this lib instead)..

now i ve worked out a solution to solve this and allow re-ordering to take place. That is to simply do a 2 stage parsing.. stage one would do: text => <spanu>text</spanu> .. then in the lib i would have defined spanu to be a valid inline element.. once the hl was done with it, i would then convert <spanu>text</spanu> -> <span class="underline">text<!--u--><span> .. this offcourse will only be done for tags that are completely defined through the parser and accept no extra attributes ( for security reasons)

the question is: is there a smarter way to do this that can save the 2 part conversion (maybe some changes to hl? - i admit i did attempt that, but not with much success... i reached the point where it did it correctly for some cases, but on others (for malformed input offcourse) it re-ordered it in such a way that eg input-> [ b]sadf[ u]asdf[ /b][ /u] and it would change it to -> [ b]asdf[ u]asdf<!--u-->[ /b]</span>? the reason it didnt get parsed back to [ /u] is because it exects <!--u--></span> which obviously wasnt there)

and my 2nd questions is:
there are a ton of deprecated elements and/or elements that i will simply never use as my bbcode parser will never produce them (center, form, input, applet, button, fieldset etc etc).. is it safe to assume to remove those in order to speed up the libs processing time (smaller arrays, less things to look for - at least for kses the number of elements/attributes i provided really affected the processing time)

the way i call it now is the kses way (where i explicitly define the elements and attributes allowed for those elements - the array is dynamically generated on the parsing so if there is eg no [b] tag i wont include 'strong' => array() at all in the array of allowed_html elements (in kses that made a big difference in processing time - i havent tested it on this lib since my code already did that)

thanks again.

4

Re: kses compatibility question

I assume your application, irrespective of the availability of a good HTML purifier/filter, *has* to keep using the BBCode system for at least one of these reasons:

-- you don't want to fiddle with the code to inactivate BBCode
-- users are not HTML-savvy
-- the BBCode is auto-generated by some Javascript-based intermediary

Otherwise, one could just use an HTML filter and be done with BBCode parsing/unparsing.

---------------------------------------------------------

Can't the system just do the kses/htmLawed-filtering on the input and store it without BBCode parsing (which would be done during web page generation [but not for display in textareas during editing])? That way, there is no need for unparsing.

This does require htmLawed work at two different stages. During input processing, it's main work can be to remove unwanted HTML without bothering about balanced or properly-nested tags. During web page generation, it would need to just do tag balancing.

---------------------------

Even otherwise, for unparsing won't a simple string replacement work,? Like:

$show_in_textarea = str_replace(array('<strong>', '</em>',...), array('[b]', '[/i]',...), $parsed);

5 (edited by j8 2008-03-22 21:30:13)

Re: kses compatibility question

your assumptions are correct in the reasons we need to keep bbcodes. mainly for ease of use and backwards compatibility for the thousands of posts we have.

the reason pre-parsing is done and stored in the db as the parsed text is simply for performance reasons. There are far more views than inserts. there is little point in parsing from bbcode on every request.. for 100 posts per page, taht is 100 parses of possibly fairly large texts, and that is just for 1 member. With 3k online at any given time, you can see why it is benefical to keep it parsed in the db and just output it as it is, with no further checks; even if the parsing stage has to take a big longer when inputting data in.. it simply overall gives a better efficiency.
--------
if i where to store it in bbcode, the first stage u mentioned isnt even necessary... an htmlentities call would secure the input completely (as the bbcodes arent parsed yet, and the bbcodes themselves use [] instead of <>. then on output i would have to parse it to html, and then check the html produced using this lib or kses. I think it is much better to pre-parse it and have a bit more processing time during parsing/unparsing (unparsing is very fast actually with the <!--x--> system i m using since there are no security checks as that is done during parsing, and its just a simple str_replace call as mentioned below).. generally i believe in the idea that u secure while u insert and then just display as is...

a simpler example would be if you had to store a 'name'. it would be much more efficient to htmlentties the name be4 u insert in the db, and then just display it as is, instead of securing it each time u display it
---------
for the strong and em yea... but what about more complex div and span created tags that use classes to define the style..

eg:
[box]text[/box] => <div class="box">text</div>
[bbox]text[/bbox] => <div class="bbox">text</div>

with a simple str_replace there is no way in distinquishing the 2 </div> ., u wouldnt know which one is box and which one bbox.. offcourse my unparsing is simple str_replaces... which is y i put the comments in there... so <div class="box">text<!--box--></div> can easily be translated str_replace(array('<div class="box">','<!--box--></div>'), array('[box]','[/box]')); etc

you could argue that i could just use [/box] for all </div> closes, and get done with this issue alltogether. but that wont look nice for the end user :P

+ the fact that the box/bbox example wasnt a good one..

what if u have floaters.. [left] and [right].. both use div's to accomplish it.. u can float anything u want in them, images, quote boxes, text, urls etc... having [left]asdf[/box] wouldnt make much sense ;)

or a [ code] box... or [center] to center whatever u put inside it.. a similar problem arizes with span tags; size, color, underlined, dropcaps, strike throughs etc... the bbcode i m offering is quite rich. html as input would be very hard for the members, would they need to memorize class names and learn html in order to make a nice looking post?


offcourse you could use JS to create a nice editor that automates all of this. but then what about members that have js disabled?

6

Re: kses compatibility question

I think I got the full picture now. So at this stage you want to figure out how to pass custom elements through htmLawed and to speed up the filter even more.

A significant portion of htmLawed's work, balancing/nesting tags, is done by function 'hl_bal'. This can be turned off ('balance'=>0) to hasten some phase of your application's workflow.

To add custom elements, edit the code that defines the $ec array inside function 'htmLawed' to add the new elements:

// config: elements
$ec = array('a'=>1, 'abbr'=>1, ...

You can also remove elements that will never be in the input (like 'form').

To further reduce array look-ups, you can also edit code defining the $aN and $aU arrays that define valid attributes in 'hl_tag'.

If there isn't going to be any tag balancing, this is all that is needed. Otherwise, the custom elements have to be specified in the 'hl_bal' function. You'd also want to remove the unnecessary standard elements from the various arrays defined in the beginning of the function's code.

7

Re: kses compatibility question

the custom elements way is the way i had thought of as welll in exactly this manner ;)

thanks for the detailed explanations :D ..

8

Re: kses compatibility question

kk its working great so far.. ive done all this and the processing speed is really good

id like to make a suggestion for the code

when u have block elements inside inline elements, it would be nice if, instead of either removing it or sanitizing it, it would be processed in such a way as to make the html valid... that could obviously be just an option, since it could increase processing time a bit.

let me explain what i mean graphically

input=> <strong>some text<div>some more text</div>aaaa</strong>
right now this would either remove <div> and </div> alltogether, or convert it to special chars and print it as is..

what i m suggesting is that it does what the inputer intended to do (make everything bold)..
eg
<strong>some text</strong><div><strong>some more text</strong></div><strong>aaaa</strong>

a more complex example would be:
input-> <strong>some text<div>some more </strong> and <i><strong>text </div>aa</strong></i>
output => <strong>some text</strong><div><strong>some more </strong> and <i><strong>text</strong></i></div><i><strong>aa</strong></i>

i am not sure if its possible, but i think it d be a neat addition if it can be done..
basically when u have a block inside an inline element, u close the inline be4 the start of the block, then re-open it after the block element has been started, and continue the process until u find a stopper for the inline element (in which case we remove it from the correction_stack) or if there is none, just add it to the end of the text. if u find another inline element on the way, and that is non-proper as well, u push it in the correction_stack and do the same u would with just 1 inline element. if the other inline element is properly closed and confines the rules set, then we pop it from the stack and continue with just our faulty inline element. there cant be 2 inline elements of the same type in the stack with this algorithm so that isnt a problem (if we find another opener for that same element, we just remove it if its in the stack)

I dont know if i am making any sense (hopefully i am ;) ) but i think with the 'tree' ur creating this would be possible with an extra array that holds the notion of the 'stack' (and we only remove from the array when we find the equivalent stop element or when we reach the end)

9

Re: kses compatibility question

Thanks j8 for the suggestion.

I too have been thinking of implementing a better correction algorithm. Having a stack array can also provide a way to check for admin-specified nesting (e.g., not allowing 'hr' inside 'td') and for better standard compliance (e.g., allowing 'tfoot' only before 'tbody', preventing empty 'table', etc.), though the speed will take some hit.

In your specific case, though, a preview system, that allows users to see what the post would look like, can reduce the need for such a better correction algorithm.

10 (edited by j8 2008-03-23 21:29:12)

Re: kses compatibility question

yea i already provide a preview system (with ajax when its available or just a simple $_POST on that same page)

regardless though (i m not saying the current correction is not good) I think the auto correction system could be improved greatly, maybe without much of a performance hit... in my case that would be neat because if someone wants to color the whole post blue (dont know y someone would want that, but i m just saying :P ), the system would automatically figure that out and break up the span so it will provide the required functionality to the end user and be valid xhtml at the same time (it basically means that it simplifies the inputing of tags much more)

[ b]bla[div]bla[/div]bla[ /b]
is much simpler than
[ b]bla[ /b][div][ b]bla[ /b][/div]bla[ /b]
at least in what u have to type.

if it did it automatically (the conversion from one to the other) though it would be a non-issue. I could add a [bb] tag for example which creates a div instead, but that is just more tags for someone to learn and also requires an understanding of what is a block and what is an inline element, which many users lack :(