NYCPHP Meetup

Tue Nov 22 05:19:55 EST 2005

Allen Shaw wrote:
> I have an app that should be storing input only in Latin-1 characters, 
> but which will probably be used by English-speaking individuals in Asia, 
> Middle East, and other locales. I expect that some of those people will 
> sometimes type input in their own local character set, but I really do 
> not want to store information that I won't even be able to read later.
> 
> I've been reading around and have gotten more of an understanding of 
> this issue lately, but still don't understand enough to filter user 
> input to be sure it's in the Latin-1 character set. I tried using 
> "accept-charset" attribute in <form> tag, and I also thought that at 
> least MySQL would mangle input that's not Latin-1 on this database 
> having a charset value of "Latin-1", but darnit, mysql is too good for 
> me and stores and retrieves Korean and Japanese text without a hitch.

The problem is that you cannot accurately identify different 8 bit 
encodings from each other. Latin-1 (iso-8859-1) and Latin-9 
(iso-8859-15) text may contain identical byte sequences and still 
different content so you have no way to know which one user intended 
to use.

Some 8 bit encodings have different *probabilities* for different 
byte sequences and you could make an educated guess which encoding 
the user agent really used. That would still be just a guess.

The way I do it is that I send the html with UTF-8 encoding (I also 
have <form accept-charset="UTF-8" ...> in case some user agent 
supports that, most user agents just use the same encoding the page 
with the form used) and I check that the user input is valid UTF-8 
byte sequence. If the user input isn't a valid UTF-8 sequence then I 
just take it as a fact that the real encoding is iso-8859-1 and I 
blindly convert from iso-8859-1 to UTF-8 -- I've yet to see an user 
agent which does support different encodings AND doesn't support 
UTF-8 AND defaults to any other charset but iso-8859-1.

So far, this has worked perfectly. Or should I say, almost 
correctly... I'm pretty sure that some windows user agents send 
incorrect unicode characters that are technically correctly encoded 
as UTF-8. I believe that those invalid characters are meant to 
represent code points from windows-1252 character set and user agent 
doesn't correctly translate from windows-1252 to UTF-8. Expect to 
see such input if user, for example, copies text containing "smart 
quotes" from MS Word to text area.

> Naturally mangled input is not what I want, so I guess it's okay that 
> mysql didn't destroy the data. :o) What I really want is that if 
> $_POST['usertext'] == 大阪市浪速区のマンション [wow, how does that look 
> in your mail reader?] then the app will know it and tell the user 
> "please enter only characters in the Latin-1 set".

Those japanese characters work just fine because your mail used 
UTF-8 encoding.

-- 
Mikko

NYCPHP Meetup

NYPHP.org

[nycphp-talk] enforcing Latin-1 input