[nycphp-talk] enforcing Latin-1 input
Mikko Rantalainen
mikko.rantalainen at peda.net
Tue Nov 22 05:19:55 EST 2005
Allen Shaw wrote:
> I have an app that should be storing input only in Latin-1 characters,
> but which will probably be used by English-speaking individuals in Asia,
> Middle East, and other locales. I expect that some of those people will
> sometimes type input in their own local character set, but I really do
> not want to store information that I won't even be able to read later.
>
> I've been reading around and have gotten more of an understanding of
> this issue lately, but still don't understand enough to filter user
> input to be sure it's in the Latin-1 character set. I tried using
> "accept-charset" attribute in <form> tag, and I also thought that at
> least MySQL would mangle input that's not Latin-1 on this database
> having a charset value of "Latin-1", but darnit, mysql is too good for
> me and stores and retrieves Korean and Japanese text without a hitch.
The problem is that you cannot accurately identify different 8 bit
encodings from each other. Latin-1 (iso-8859-1) and Latin-9
(iso-8859-15) text may contain identical byte sequences and still
different content so you have no way to know which one user intended
to use.
Some 8 bit encodings have different *probabilities* for different
byte sequences and you could make an educated guess which encoding
the user agent really used. That would still be just a guess.
The way I do it is that I send the html with UTF-8 encoding (I also
have <form accept-charset="UTF-8" ...> in case some user agent
supports that, most user agents just use the same encoding the page
with the form used) and I check that the user input is valid UTF-8
byte sequence. If the user input isn't a valid UTF-8 sequence then I
just take it as a fact that the real encoding is iso-8859-1 and I
blindly convert from iso-8859-1 to UTF-8 -- I've yet to see an user
agent which does support different encodings AND doesn't support
UTF-8 AND defaults to any other charset but iso-8859-1.
So far, this has worked perfectly. Or should I say, almost
correctly... I'm pretty sure that some windows user agents send
incorrect unicode characters that are technically correctly encoded
as UTF-8. I believe that those invalid characters are meant to
represent code points from windows-1252 character set and user agent
doesn't correctly translate from windows-1252 to UTF-8. Expect to
see such input if user, for example, copies text containing "smart
quotes" from MS Word to text area.
> Naturally mangled input is not what I want, so I guess it's okay that
> mysql didn't destroy the data. :o) What I really want is that if
> $_POST['usertext'] == 大阪市浪速区のマンション [wow, how does that look
> in your mail reader?] then the app will know it and tell the user
> "please enter only characters in the Latin-1 set".
Those japanese characters work just fine because your mail used
UTF-8 encoding.
--
Mikko
More information about the talk
mailing list