1

Topic: Refining tag transformation?

Hi there!
I have a question regarding transforming tags. I have a xml file generated by a flash CMS that in turn become the content both in flash and for xhtml pages as alternate content for a full flash site. Flash has limited number of html tags that can be used & not all of them are valid in XHTML 1.0 strict. So therefore the use of HTML-purifier. This is an excerpt from the XML-file:

<FONT FACE="Arial" SIZE="38" COLOR="#000000" LETTERSPACING="0" KERNING="0">This is the head.</FONT>

<FONT FACE="Matrix" SIZE="8" COLOR="#000000" LETTERSPACING="0" KERNING="0">This is the body copy.</FONT>

As anyone can see I have the font tag but with different value for the face attribute. Is there a way to transform the font tag differently depending on the value of the face attribute?
My knowledge of php is quite limited... :-(
So I would like the font tag for the header to tranforms to bold  or a header tag but not for the body copy...
If someone has the time to help me out with this it would be great!!

I've actually tried using HTML Purifier but never really managed to set it up the way I wanted. Also that library is just huge in comparison to HTMLawed!! It's seems to be such overkill for what I'm trying to acheive. I've played  around with HTMLawed & it feels so simple and faster in comparison!!

Niklas

2

Re: Refining tag transformation?

It is possible to do so through a custom function called through $config['hook_tag'] -- see the documentation to get an idea. I can help write such a function if you provide me with the desired functionality and possible scenarios [what font-size to use for which font, the range of font-sizes possible in the input, etc.].

3

Re: Refining tag transformation?

Hi there Patnaik!

First of all - thanks for the fast answer!!
I've looked in the documentation at the hook_tag thru the link that you provided.
My knowledge of php is good enough to follow documentation for different settings etc. for HTMLawed but to create a function for the hook_tag seems a little over my head at the moment so if you have the time to help me it would really be appreciated!!
Actually the only thing that will determine different tag tranforms will be the font used. So I guess what only matter is the value of the face attribute? I will remove everything else like size etc and set that thru a linked css file...
Niklas

4

Re: Refining tag transformation?

Tag transformation results in <font> getting converted to a <span> with a 'style' attribute within. With a custom function called through hl_tag the 'style' value can be altered to use a different 'font-size' depending on the font. As you mention, one can alternately have the function replace 'style' with 'class'. If you want some coding help, let me know.

5

Re: Refining tag transformation?

Hmm...I see.
What I'm actually looking for is this:
The font tag get converted to a header tag - like h2 (head tags are not supported in flash) when a specific font is beeing used. I know that this is kind of strange but the reason is for SEO purposes. A certain font is beeing used for headers in the flash content (which is a non standard font since you can embed fonts in flash) & my goal in the xhtml content is to tell the search engines that these are headers. The span tag is not that important... I don't know if this is possible since it's kind of a special case :-)

Niklas

6

Re: Refining tag transformation?

Below are some code snippets you can fiddle with.

In this scenario, htmLawed is set to transform the <font> elements, converting them to <span>. htmLawed is also set to use a custom function, 'my_font_transform', so the newly generated <span> elements are re-filtered/-transformed.

As you can see, one has many ways to get the right transformations performed.

$config = array(... 'hook_tag' => 'my_font_transform' ...); // htmLawed config
$out = htmLawed($in, $config); // the htmLawed call

The 'my_font_transform' function receives two arguments, the element name, and the attribute name-value pairs (in an array). All elements go through it, but only the relevant <span> ones are altered.

function my_font_transform($element, $attribute_array){

  // Elements other than 'span' or 'span' without a 'style' attribute are returned unchanged
  if($element == 'span' && isset($attribute_array['style'])){

    // Identify CSS properties and values
    $css = explode(';', $attribute_array['style']);
    $style = array();
    foreach($css as $v){
      if(($p = strpos($v, ':')) > 1 && $p < strlen($v)){
        $css_property_name = trim(substr($v, 0, $p));
        $css_property_value = trim(substr($v, $p+1));
        $style[] = "$css_property_name: $css_property_value";
      }
    }
    
    // Alter the CSS property as required

    // Black Arial must be at a font-size of 24
    if(isset($style['font-family']) && $style['font-family'] == 'Arial' && isset($style['color']) && $style['color'] == '#000000'){
      $style['font-size'] == '24';
    }

    // And so on for other criteria
    // ...

    // Re-build 'style'
    $attribute_array['style'] = implode('; ', $style);
  }

  // Build the attributes string
  $attributes = '';
  foreach($attribute_array as $k=>$v){
    $attributes .= " {$k}=\"{$v}\"";
  }

  // Return the opening tag with attributes
  static $empty_elements = array('area'=>1, 'br'=>1, 'col'=>1, 'embed'=>1, 'hr'=>1, 'img'=>1, 'input'=>1, 'isindex'=>1, 'param'=>1);
  return "<{$element}{$attributes}". (isset($empty_elements[$element]) ? ' /' : ''). '>';
}

To convert <span> with a certain style to, say, <h2>, one can use

...
    // Black Arial at a font-size of 38 should be an h2
    if(isset($style['font-family']) && $style['font-family'] == 'Arial' && isset($style['color']) && $style['color'] == '#000000' && isset($style['font-size']) && $style['font-size'] == '38'){
      return '<h2>';
    }
...

Note that the above code will temporarily 'unbalance' the input, with '... <font>... </font>...' becoming '... <span>... </span>...' and then '... <h2>... </span>...'. However, with tag balancing turned on, the text will eventually get corrected to '.... <h2>... </h2>'.

7

Re: Refining tag transformation?

Okay!
Thanks a lot for that!
I've tried different things here but I don't really understand where the last snippet is supposed to be in the complete context. Is the middle snippet and the last one dependent on eachother or is the middle a complete solution that can be altered with the last snippet?
Sorry that I'm such a newbie here :-( !
Niklas

8

Re: Refining tag transformation?

You're right; the last code snippet is an example of a variation of the middle one. The crucial functionality of the middle snippet is in the part:

    // Alter the CSS property as required

    // Black Arial must be at a font-size of 24
    if(isset($style['font-family']) && $style['font-family'] == 'Arial' && isset($style['color']) && $style['color'] == '#000000'){
      $style['font-size'] == '24';
    }

    // And so on for other criteria
    // ...

I put in some sample code (above) illustrating the use of conditional criteria for altering the font-size based on the font-family. The last snippet in my previous post was to convert the element to <h2> instead.

9

Re: Refining tag transformation?

Hi there - here I am again. I just can't make this work - I must be missing some vital part here.
Here is an excerpt of my code:

include("htmLawed.php");
    
    switch($_GET['swfaddress']) {
        case '/':            
            //Get xml file
            $xmlFileData = file_get_contents("xml/liveTEXT.xml");
            
            //Simple XML parser
            $xmlData = new SimpleXMLElement($xmlFileData);
            
            $xmlPara = $xmlData->page[0]->text;    
                    
            
            //"Purify" HTML to valid XHTML
            //$config ["valid_xhtml"] = 1;
            $config = array('hook_tag' => 'my_font_transform', 'balance' => 1); // htmLawed config
            $pure_html = htmLawed($xmlPara, $config);
            
                        
            function my_font_transform($element, $attribute_array){

                      // Elements other than 'span' or 'span' without a 'style' attribute are returned unchanged
                      if($element == 'span' && isset($attribute_array['style'])){
                    
                        // Identify CSS properties and values
                        $css = explode(';', $attribute_array['style']);
                        $style = array();
                        foreach($css as $v){
                          if(($p = strpos($v, ':')) > 1 && $p < strlen($v)){
                            $css_property_name = trim(substr($v, 0, $p));
                            $css_property_value = trim(substr($v, $p+1));
                            $style[] = "$css_property_name: $css_property_value";
                          }
                        }
                        
                        // Alter the CSS property as required
                    
                         // Black Arial at a font-size of 38 should be an h2
                        if(isset($style['font-family']) && $style['font-family'] == 'FFF Executive Bold' && isset($style['color']) && $style['color'] == '#000000' && isset($style['font-size']) && $style['font-size'] == '8'){
                          return '<h2>';
                        }

                    
                        // And so on for other criteria
                        // ...
                    
                        // Re-build 'style'
                        $attribute_array['style'] = implode('; ', $style);
                      }
                    
                      // Build the attributes string
                      $attributes = '';
                      foreach($attribute_array as $k=>$v){
                        $attributes .= " {$k}=\"{$v}\"";
                      }
                    
                      // Return the opening tag with attributes
                      static $empty_elements = array('area'=>1, 'br'=>1, 'col'=>1, 'embed'=>1, 'hr'=>1, 'img'=>1, 'input'=>1, 'isindex'=>1, 'param'=>1);
                      return "<{$element}{$attributes}". (isset($empty_elements[$element]) ? ' /' : ''). '>';
            }
                            
                        
            //The output    
            $final_html = "<table><tr><td>" . $pure_html . "</td></tr></table>";        
            echo($final_html);
            
            break;
            
                        
            
            
       case '/about/': 
            //Get xml file
            $xmlFileData = file_get_contents("xml/liveTEXT.xml");
            
            //Simple XML parser
            $xmlData = new SimpleXMLElement($xmlFileData);
            
            $xmlPara = $xmlData->page[1]->text;
            
            $pure_html = $purifier->purify($xmlPara);
            
            $final_html = "<table><tr><td>" . $pure_html . "</td></tr></table>";
                        
            //The output            
            echo($final_html);

The last case code is not correct at the moment for htmLawed but I'm trying to make the first section to work before moving on...
My xml look like this:

<?xml version="1.0" encoding="utf-8"?><website><page><text id="greeting">&lt;P ALIGN=&quot;LEFT&quot;&gt;&lt;FONT FACE=&quot;FFF Executive Bold&quot; SIZE=&quot;8&quot; COLOR=&quot;#666666&quot; LETTERSPACING=&quot;0&quot; KERNING=&quot;0&quot;&gt;This text should be converted to h2 tag.&lt;/FONT&gt;&lt;/P&gt;</text></page></website>

The output I get - the $final_html is this:

<table><tr><td><p style="text-align: LEFT;"><span style="font-family: FFF; color: #666666;">This text should be converted to h2 tag.</span></p></td></tr></table>

The font becomes FFF - I've tried using both FFF and FFF Executive Bold in the vital part for tag tranform but with no luck. :(

Niklas

10

Re: Refining tag transformation?

You have found a bug in htmLawed. The regular expression it uses to parse <font> attributes during tag transformation needs to be improved.

if(preg_match('`face\s*=\s*(\'|")?(.+?)(\\1|\s|$)`i', $a, $m)){ 

Thanks for the feedback. I should be able to correct this later today.

Updated: The new release, version 1.1.2, fixes this issue [incorrect parsing of <font> face when the value contains the space character].

11

Re: Refining tag transformation?

Niklas,

Try the following for the my_font_transform function:

function my_font_transform($element, $attribute_array){
  // Elements other than 'span' or 'span' without a 'style' attribute are returned unchanged
  if($element == 'span' && isset($attribute_array['style'])){
    // See if font properties appropriate to make an  h2
    $css = explode(';', $attribute_array['style']);
    $style = array();
    foreach($css as $v){
      if(($p = strpos($v, ':')) > 1 && $p < strlen($v)){
        $style[trim(substr($v, 0, $p))] = trim(substr($v, $p+1));
      }
    }
    if(isset($style['font-family']) && $style['font-family'] == 'FFF Executive Bold' && isset($style['color']) && $style['color'] == '#000000'){
      return '<h2>';
    }
  }
                    
  // Build the attributes string
  $attributes = '';
  foreach($attribute_array as $k=>$v){
    $attributes .= " {$k}=\"{$v}\"";
  }
                    
  // Return the opening tag with attributes
  static $empty_elements = array('area'=>1, 'br'=>1, 'col'=>1, 'embed'=>1, 'hr'=>1, 'img'=>1, 'input'=>1, 'isindex'=>1, 'param'=>1);
  return "<{$element}{$attributes}". (isset($empty_elements[$element]) ? ' /' : ''). '>';
}

You'll note that I have removed the 'font-size' condition from the logic. The 'size' attribute for the <font> element is not supposed to have a value more than 7.

htmLawed is relatively tolerant of faulty HTML but it cannot be too tolerant. For tag transformation, it ignores size attribute in <font> if the value exceeds 7. You can modify htmLawed, however, to not do so by replacing a couple of lines in the 'hl_tag2' function with:

 if(preg_match('`size\s*=\s*(\'|")?(.+?)(\\1|\s|$)`i', $a, $m)){
  $a2 .= ' font-size: '. (isset($fs[($m = trim($m[2]))]) ? $fs[$m] : $m). ';';
 }

12 (edited by Niklasinla 2009-01-23 08:44:02)

Re: Refining tag transformation?

Okay!
Thanks for the fast update!
I've used both code examples above and the Font name is correct and the size is also added.
But it still doesn't do the conversion to h2 tags.

I've set up the crucial part of the my_font_transform function to this:

if(isset($style['font-family']) && $style['font-family'] == 'FFF Executive Bold' && isset($style['color']) && $style['color'] == '#666666' && isset($style['font-size']) && $style['font-size'] == '8')
            {
              return '<h2>';
            }

But the output seems to be about the same as using htmLawed in it's simple cofiguration except the size attribute of the font that is added after the modified hl_tag2.  Also it seems like it added an empty span tag. See below for my output:

<table><tr><td><p style="text-align: LEFT;"><span style="font-family: FFF Executive Bold; color: #666666; font-size: 8;">This text should be converted to h2 tag.<span></span></span></p></td></tr></table>

This is my htmLawed setup:

//"Purify" HTML to valid XHTML
            $config = array('hook_tag' => 'my_font_transform', 'balance' => 1); // htmLawed config
            $pure_html = htmLawed($xmlPara, $config);

I'm using the updated htmLawed version 1.1.2

Niklas

13

Re: Refining tag transformation?

I tried the following input

<p style="text-align: LEFT;">
  <font face="FFF Executive Bold" color="#666666" size="8">
This text should be converted to h2 tag.
  </font>
</p>

The output I get with the custom function is

<p style="text-align: LEFT;">
</p>
<h2>This text should be converted to h2 tag.
</h2>

So it is working, though now one is left with an empty <p>. This is because <h2> cannot be nested inside <p> (the correction occurs during tag balancing). Depending on your needs, you can leave the empty <p>, or remove it using a string replacement function on the htmLawed output, or through some code in the hook_tag function.

I think the custom function code is same as yours

function my_font_transform($element, $attribute_array){
  // Elements other than 'span' or 'span' without a 'style' attribute are returned unchanged
  if($element == 'span' && isset($attribute_array['style'])){
    // See if font properties appropriate to make an  h2
    $css = explode(';', $attribute_array['style']);
    $style = array();
    foreach($css as $v){
      if(($p = strpos($v, ':')) > 1 && $p < strlen($v)){
        $style[trim(substr($v, 0, $p))] = trim(substr($v, $p+1));
      }
    }
    if(isset($style['font-family']) && $style['font-family'] == 'FFF Executive Bold' && isset($style['color']) && $style['color'] == '#666666' && isset($style['font-size']) && $style['font-size'] == '8')
    {
      return '<h2>';
    }
  }
                    
  // Build the attributes string
  $attributes = '';
  foreach($attribute_array as $k=>$v){
    $attributes .= " {$k}=\"{$v}\"";
  }
                    
  // Return the opening tag with attributes
  static $empty_elements = array('area'=>1, 'br'=>1, 'col'=>1, 'embed'=>1, 'hr'=>1, 'img'=>1, 'input'=>1, 'isindex'=>1, 'param'=>1);
  return "<{$element}{$attributes}". (isset($empty_elements[$element]) ? ' /' : ''). '>';
}

14

Re: Refining tag transformation?

Hi there!!
I start to believe that there must be some collsion or so with some other php-code in my file because I can't make this work. I tried removing the code handling the xml and tried with the content in html directly but with the same result as above. I have a file called datasource.php which handles the content & then that content is sent & "plugged" into an index.php file. I've checked to output directly from the datasource.php and the content is behaving strange there before it is sent to index.php. See this link (http://widecircle.se/labs/webbmall/datasource.php?swfaddress=/) My conclusion is that the problem must be in the datasource.php directly. My complete code of the datasource.php is this (once again the case for the about section is not setup correctly for htmLawed:

<?php

    header('Content-Type: text/xml;charset=utf-8');
    $base = strtolower(substr($_SERVER['SERVER_PROTOCOL'], 0, strrpos($_SERVER['SERVER_PROTOCOL'], '/'))) . '://' . $_SERVER['SERVER_NAME'] . substr($_SERVER['PHP_SELF'], 0, strrpos($_SERVER['PHP_SELF'], '/'));
    
    include("htmLawed.php");
    
    switch($_GET['swfaddress']) {
        case '/':            
            //Get xml file
            $xmlFileData = file_get_contents("xml/liveTEXT.xml");
            
            //Simple XML parser
            $xmlData = new SimpleXMLElement($xmlFileData);
            
            $xmlPara = $xmlData->page[0]->text;    
                    
            
            //"Purify" HTML to valid XHTML
            $config = array('balance' => 1, 'hook_tag' => 'my_font_transform'); // htmLawed config
            $pure_html = htmLawed($xmlPara, $config);
            
                        
            function my_font_transform($element, $attribute_array){
              // Elements other than 'span' or 'span' without a 'style' attribute are returned unchanged
              if($element == 'span' && isset($attribute_array['style'])){
                // See if font properties appropriate to make an  h2
                $css = explode(';', $attribute_array['style']);
                $style = array();
                foreach($css as $v){
                  if(($p = strpos($v, ':')) > 1 && $p < strlen($v)){
                    $style[trim(substr($v, 0, $p))] = trim(substr($v, $p+1));
                  }
                }
                if(isset($style['font-family']) && $style['font-family'] == 'FFF Executive Bold' && isset($style['color']) && $style['color'] == '#666666' && isset($style['font-size']) && $style['font-size'] == '8')
                {
                  return '<h2>';
                }
              }
                                
              // Build the attributes string
              $attributes = '';
              foreach($attribute_array as $k=>$v){
                $attributes .= " {$k}=\"{$v}\"";
              }
                                
              // Return the opening tag with attributes
              static $empty_elements = array('area'=>1, 'br'=>1, 'col'=>1, 'embed'=>1, 'hr'=>1, 'img'=>1, 'input'=>1, 'isindex'=>1, 'param'=>1);
              return "<{$element}{$attributes}". (isset($empty_elements[$element]) ? ' /' : ''). '>';
            }
    
                            
                        
            //The output    
            $final_html = "<table><tr><td>" . $pure_html . "</td></tr></table>";        
            echo($final_html);
            
            break;
            
                        
            
            
       case '/about/': 
            //Get xml file
            $xmlFileData = file_get_contents("xml/liveTEXT.xml");
            
            //Simple XML parser
            $xmlData = new SimpleXMLElement($xmlFileData);
            
            $xmlPara = $xmlData->page[1]->text;
            
            $pure_html = $purifier->purify($xmlPara);
            
            $final_html = "<table><tr><td>" . $pure_html . "</td></tr></table>";
                        
            //The output            
            echo($final_html);
            
            break;           
            
            
        
        default:
            echo('<p><!-- Status(404 Not Found) -->Page not found.</p>');
            break;
    }
?>

I might have to use something else then htmLawed for this since this seems to be a special case kind of thing. I'm also going to look into the simple xml some more...not sure I'm handling that correctly...

15

Re: Refining tag transformation?

If I use the example XML string from post #9 above, the transformation _does_ work well. The test script code is below (htmLawed balances tags by default so "$config['balance'] = 1" is not needed).

// Error reporting
error_reporting(E_ALL | (defined('E_STRICT') ? E_STRICT : 0));
ini_set('display_errors', 1);

// Get <text> from the XML string
$xml = new SimpleXMLElement('<website><page><text id="greeting">&lt;P ALIGN=&quot;LEFT&quot;&gt;&lt;FONT FACE=&quot;FFF Executive Bold&quot; SIZE=&quot;8&quot; COLOR=&quot;#666666&quot; LETTERSPACING=&quot;0&quot; KERNING=&quot;0&quot;&gt;This text should be converted to h2 tag.&lt;/FONT&gt;&lt;/P&gt;</text></page></website>');

// htmLawed filtering
include 'htmLawed.php';
echo htmLawed($xml->page[0]->text, array('hook_tag' => 'my_font_transform'));

// The custom hook_tag function
function my_font_transform($element, $attribute_array){
 // Elements other than 'span' or 'span' without a 'style' attribute are returned unchanged
 if($element == 'span' && isset($attribute_array['style'])){
  // See if font properties appropriate to make an  h2
  $css = explode(';', $attribute_array['style']);
  $style = array();
  foreach($css as $v){
   if(($p = strpos($v, ':')) > 1 && $p < strlen($v)){
    $style[trim(substr($v, 0, $p))] = trim(substr($v, $p+1));
   }
  }
  if(isset($style['font-family']) && $style['font-family'] == 'FFF Executive Bold' && isset($style['color']) && $style['color'] == '#666666' && isset($style['font-size']) && $style['font-size'] == '8'){
   return '<h2>';
  }
 }
 // Build the attributes string
 $attributes = '';
 foreach($attribute_array as $k=>$v){
  $attributes .= " {$k}=\"{$v}\"";
 }
 // Return the opening tag with attributes
 static $empty_elements = array('area'=>1, 'br'=>1, 'col'=>1, 'embed'=>1, 'hr'=>1, 'img'=>1, 'input'=>1, 'isindex'=>1, 'param'=>1);
 return "<{$element}{$attributes}". (isset($empty_elements[$element]) ? ' /' : ''). '>';
}

16

Re: Refining tag transformation?

As of the new, 1.1.11, version of htmLawed, when a 'hook_tag' function has been declared, closing tag contents (and not just opening tag contents) are also passed to the function. If upgrading htmLawed, you may need to edit the 'hook_tag' function. See http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/htmLawed_README.htm#s4.5.