NYCPHP Meetup

Thu Nov 24 04:56:57 EST 2005

Allen Shaw wrote:
> Mikko Rantalainen wrote:
> 
>>Allen Shaw wrote:
>>
>>>[snip] what you think of this half-baked idea: [snip...]
> 
>>You cannot trust that behavior. Specification only says (IIRC) that 
>>the user agent MUST not send characters outside iso-8859-1 on such a 
>>form. 
> 
> Okay, I'm in way over my head here.  I'd like to get my hands on that 
> spec -- would you have a link or some reasonably unique keywords to 
> google for (w3c, character encoding, specification, etc. don't seem to 
> be cutting it...)?  I should just dig in there and understand what I'm 
> doing before trying to implement anything, I think.

http://whatwg.org/specs/web-forms/current-work/#unacceptableCharacters
http://www.w3.org/TR/1999/REC-html401-19991224/interact/forms.html#adef-accept-charset
http://www.ietf.org/rfc/rfc2388 section "4.5 Charset of text in form 
data" and section "5.6 Interoperability with web applications"

Be warned that some older clients don't like to use 
"multipart/form-data" encoding which should include a clear label 
defining the used encoding.

>>I guess that what I'm trying to tell you is that to *force* 
>>iso-8859-1 input only, you're going to have to use UTF-8 for the 
>>form and you'ge going to have to use UTF-8 internally. That's the 
>>only way you can really get in iso-8859-1 encoding the same data the 
>>user really tried to input.
> 
> What I'm really trying to do is not encode their input into Latin-1, but 
> figure out if they _entered_ Latin-1 characters in the form and if so 
> accept it, or if not, reject it and tell them why.  If we just encode 

But the problem is that unless you're using UTF-8, you cannot always 
identify between iso-8859-1 and say windows-1255. The safest way I 
can think about is to require UTF-8 encoding and then check that the 
  real data I'm getting only uses characters that can be represented 
with iso-8859-1. UTF-8 encoding is nice because it cannot contain 
any random 8 bit bytes unlike iso-8859-1 or windows-1255. It's 
pretty easy to identify illegal UTF-8 sequence. It's next to 
impossible to identify incorrectly encoded 8 bit encoding.

As I said, I'm trying to get input with UTF-8 encoding always and if 
it doesn't look like UTF-8 encoded, then I just assume that the 
input is iso-8859-1. It's not _really_ safe but as I said, it's 
really hard to notice if the *incorrectly* encoded message really 
contains something valuable in some unknown encoding.

You could just use UTF-8 for the form and check if the user agent 
sends valid UTF-8 strings. If not, report the user that the user 
agent he's using is broken. No need to guess which character the 
user really meant.

It's a shame that PHP has such a weak support for UTF-8.

-- 
Mikko

NYCPHP Meetup

NYPHP.org

[nycphp-talk] enforcing Latin-1 input