The logic behind htmLawed's clean_ms_char parameter:
Characters such as a punctuation mark (like '?') or a digit (like '1') are represented by code-points; it is a code-point that is stored on a computer or transmitted over the internet. The character-code-point relation is dictated by character encodings or charsets. So the same sequence of code-points (string of bits/bytes) can be read as different character sequences by using different character encodings.
In (X)HTML (and XML in general), certain code-points are discouraged; i.e., they should not be present in such documents. However, Windows, using the Cp-1252 or Windows-1252 encoding (or a similar encoding, like Cp-1251), uses many such code-points to represent characters such as smart-quotes in applications such as Microsoft Word (I'm not sure which versions, if any, no longer have this problem; it certainly exists in Word 2003 on Windows XP). Also, those code-points actually refer to different characters in other character encodings (such as ISO-8859-1).
When one copies those characters from, say, an MS Word document, and pastes them in a form in a browser, the code-points are copied. How the browser interprets them or converts them (when transmitting to the server) for use with another encoding can vary from browser to browser, and system to system. It is also dictated by the 'accept-charset' attribute of the form, the 'charset' attribute in 'meta' elements of the web-page containing the form, etc.
What htmLawed does, when clean_ms_char is on, is to look for such discouraged code-points and convert them to different, but valid, code-points that represent the same or similar characters.
E.g., the code-point '151' (97 in hexadecimal system), used by MS Word for an 'em dash' character, is changed to the character sequence '& # 8 2 1 2 ;' that is read by browsers as a reference to an 'em dash' character. Thus, htmLawed effectively lets the character in the input while converting the bad code-points to good ones. Note that in this example one code-point is getting replaced with a sequence of seven code-points. Because the code-point is replaced with sequences of code-points for just the '#', '&', ';' and digit characters for the character entities, the ultimate display of characters on the web-pages becomes character-encoding-independent (almost all encodings use the same code-points for those characters).
----
Some noteworthy resources on this topic:
1. Shiflett: Convert smart quotes (http://shiflett.org/blog/2005/oct/convert-smart-quotes-with-php)
2. Korpela: Windows MS characters in HTML (http://www.cs.tut.fi/~jkorpela/www/windows-chars.html)
3. Jelliffe: Text encoding problems (http://www.oreillynet.com/xml/blog/2007/10/text_encodings_if_we_know_the.html)
4. Flavell: Form submission and I18N (http://web.archive.org/web/20060427015200/ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html)