1

Topic: A note on handling UTF-8 texts

Regarding the post with id 240:

Maybe it is worth to mention that if you have the config

clean_ms_char = 1

turned on, this will destroy korean or japanese characters, despite the fact that the text is properly UTF-8 encoded.

As soon as you turn this config off, everything works as expected.

2

Re: A note on handling UTF-8 texts

Thank you for bringing this to my notice. I will add the information to the htmLawed documentation, which does not state ths clearly.

The $config["clean_ms_char"] parameter should not be used if authors do not copy-paste Microsoft-created text, or if the input text is not believed to use the Windows 1252 (Cp-1252) or a similar encoding like Cp-1251.