1

Topic: Bug? htmLawed cuts off text

Text copied and pasted from Microsoft Word gets cut off if it contains a ' or - in it.

Here's the settings I used.

$LawedConfigGold = array
(
    'comment' => 1,
    'cdata' => 1,
    'safe' => 1,
    'deny_attribute' => 'on*, style',
    'elements' => "a, b, br, p, pre, i, u, em, strong, span, div",
    'clean_ms_char' => 2,
    'hexdec_entity' => 0,
    'keep_bad' => 0
);

$LawedConfigNormal = array
(
    'comment' => 1,
    'cdata' => 1,
    'elements' => "none",
    'balance' => 1,
    'clean_ms_char' => 2,
    'hexdec_entity' => 0,
    'keep_bad' => 0,
    'safe' => 1,
    'deny_attribute' => 0
);

The test string I used was: This is a test with an ‘s and – so we need to see if it breaks.
I used Microsoft Word 2003, typed in the test string and then copied and pasted it in to the textarea and ran it through the htmLawed filter.

The result is

This is a test with an

and in removing the '

This is a test with an s and

This happens when clean_ms_char is set to 1 or 2. I was using version 1.0.7 and replaced it with 1.0.8 and tested again and the bug was still present.

2 (edited by patnaik 2008-05-30 17:05:37)

Re: Bug? htmLawed cuts off text

the issue is because of 'clean_ms_char' only and not the other config parameters such as 'safe' that the author has in the array

This is not a bug. It is a question of appropriate usage in the context of appropriately encoded web-pages and forms with the right HTML code. Also,  clean_ms_char should be used only when the input is known to use the Windows-1252/Cp-1252 character encoding (or a similar encoding, like Cp-1251).

Trying to test/troubleshoot this issue using a web interface is tricky because of the way browsers encode form data and the way they display data. However, using the form on the htmLawed demo web page, I do get the clean-up to work properly:

Using an IE 6 / IE 7 / FF 3 browser:

1. Copy-paste the string from MS Word into Input.
2. Have clean_ms_char set to 1 or 2 in Settings.
3. Have Encoding set to windows-1252.
4. Submit form.

Note
htmLawed's logic for 'clean_ms_char' is based on the Cp-1252 charset and it may give undesirable results when used for other charsets like Cp-1251 (used for Cyrillic Windows Word, e.g.). For instance, htmLawed removes code-point '8f' as it is unused in Cp-1252, but that code-point is used in Cp-1251 for Cyrillic capital dzhe. Admins expecting such charsets may not want to use the 'clean_ms_char' parameter, and may instead use other code, such as the one below, to pre-process input before passing on to htmLawed:

// Array of code-points and characters to convert to
$x = array(
  "\x8f"=>'„', // Change to HTML entity for Cyrillic capital dzhe
  "\x90"=>'', // Change to HTML entity for Cyrillic small dje
  ... // And so on
);
// Preprocess input before passing to htmLawed
$t = strtr($t, $x);
$t = htmLawed($t);

3

Re: Bug? htmLawed cuts off text

The logic behind htmLawed's  clean_ms_char parameter:

Characters such as a punctuation mark (like '?') or a digit (like '1') are represented by code-points; it is a code-point that is stored on a computer or transmitted over the internet. The character-code-point relation is dictated by character encodings or charsets. So the same sequence of code-points (string of bits/bytes) can be read as different character sequences by using different character encodings.

In (X)HTML (and XML in general), certain code-points are discouraged; i.e., they should not be present in such documents. However, Windows, using the Cp-1252 or Windows-1252 encoding (or a similar encoding, like Cp-1251), uses many such code-points to represent characters such as smart-quotes in applications such as Microsoft Word (I'm not sure which versions, if any, no longer have this problem; it certainly exists in Word 2003 on Windows XP). Also, those code-points actually refer to different characters in other character encodings (such as ISO-8859-1).

When one copies those characters from, say, an MS Word document, and pastes them in a form in a browser, the code-points are copied. How the browser interprets them or converts them (when transmitting to the server) for use with another encoding can vary from browser to browser, and system to system. It is also dictated by the 'accept-charset' attribute of the form, the 'charset' attribute in 'meta' elements of the web-page containing the form, etc.

What htmLawed does, when  clean_ms_char is on, is to look for such discouraged code-points and convert them to different, but valid, code-points that represent the same or similar characters.

E.g., the code-point '151' (97 in hexadecimal system), used by MS Word for an 'em dash' character, is changed to the character sequence '& # 8 2 1 2 ;' that is read by browsers as a reference to an 'em dash' character. Thus, htmLawed effectively lets the character in the input while converting the bad code-points to good ones. Note that in this example one code-point is getting replaced with a sequence of seven code-points. Because the code-point is replaced with sequences of code-points for just the '#', '&', ';' and digit characters for the character entities, the ultimate display of characters on the web-pages becomes character-encoding-independent (almost all encodings use the same code-points for those characters).

----

Some noteworthy resources on this topic:

1. Shiflett: Convert smart quotes (http://shiflett.org/blog/2005/oct/convert-smart-quotes-with-php)
2. Korpela: Windows MS characters in HTML (http://www.cs.tut.fi/~jkorpela/www/windows-chars.html)
3. Jelliffe: Text encoding problems (http://www.oreillynet.com/xml/blog/2007/10/text_encodings_if_we_know_the.html)
4. Flavell: Form submission and I18N (http://web.archive.org/web/20060427015200/ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html)

4

Re: Bug? htmLawed cuts off text

The htmLawed demo web page now also shows hexdumps (binaries; the code-points) of the input and the filtered output.

5

Re: Bug? htmLawed cuts off text

Only a few encodings use the code-points from 128 to 159 (decimally) -- see, e.g., this set of charset tables (http://www.columbia.edu/kermit/csettables.html). One may thus be able to use code like this to auto-set the 'clean_ms_char' parameter:

// The $config array

$config = array(
 ...
);

// Check for the 'bad' characters, and accordingly set 'clean_ms_char'

if(preg_match('`[\x80-\x9f]`', $in)){
 $config['clean_ms_char'] = 1;
}

// htmLawed process

$out = htmLawed($in, $config);