[nycphp-talk] enforcing Latin-1 input (follow-up)
Allen Shaw
ashaw at polymerdb.org
Wed Nov 30 12:06:17 EST 2005
Thanks Mikko. This is very, very helpful. As of a couple of months ago
I'm half a country away from any NYPHP meeting, but maybe I should mail
you a beer from Texas. :o)
- Allen
Mikko Rantalainen wrote:
> Allen Shaw wrote:
>
>>With much thanks to Mikko for his help, I've finally figured this out
>>enough come up with a way to test for valid Latin-1 input. If I am
>>right about this, then the following modification to Mikko's code will
>>report whether or not a particular string, assumed to be UTF-8 encoded,
>>is within the Latin-1 character set:
>>
>>-----------8<-----------
>>function isValidLatin1String($Str) {
>
>
> IT seems that this function requires UTF-8 input and doesn't convert
> it to Latin1... in which case the name of the function is misleading
> at best.
>
> I think it should be called
> isValidUtf8PresentationOfLatin1()
> instead.
>
>
>> $latinHex = array ('20', //
>> '21', // !
>> '22', // "
>> // snip for brevity...
>> 'c3bf' // ÿ
>> );
>>
>> # While checking for valid UTF-8 stream, compile each character
>> # as hex codes and match with latinHex array;
>> # correct UTF-8 stream has every character starting with zero bit
>> # or first byte has <length of encoding> high bits set and all
>> # following bytes have highest bits set to 10.
>> for ($i=0; $i<strlen($Str); $i++)
>> {
>> if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
>> else if ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
>> else if ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
>> else if ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
>> else if ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
>> else if ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
>> else return false; # invalid byte
>> # verify that n bytes matching bit sequence 10bbbbbb
>> # follow where bbbbbb is not 000000
>> # failing this test means that input is "overlong UTF-8
>> # encoding", which is not allowed.
>> $char = bin2hex($Str[$i]);
>
>
> There's an error here, if I've understood the code correctly.
> Because you have multibyte sequences in the above $latinHex array,
> you must also compose the full UTF-8 byte sequence for the input
> character to check against that array.
>
>
>> for ($j=0; $j<$n; $j++) {
>> $chara .= bin2hex($Str[++$i]);
>
>
> Oh, I think that should be |$char .=| instead. In that case, this
> part of the loop does the thing I describe above.
>
>
>> if (($i == strlen($Str))
>> || ((ord($Str[$i]) & 0xC0) != 0x80)) {
>> return false;
>> }
>> }
>> if (!in_array($char, $latinHex)) {
>> return false;
>> }
>> }
>> # couldn't find errors, it's probably valid Latin-1 data.
>
>
> Again, if there're no errors, then it's (and not just probably in
> this case!) a valid presentation of Latin1 string in UTF-8 encoding.
>
>
>> return true;
>>}
>
>
> You could also implement an easy optimization combined with
> additional features in a following way:
>
> change $latinHex array to look like
>
> $latinHex = array(
> '21' => "!",
> '22' => "\"",
> // snip for brevity...
> 'c3bf' => "\098", // ÿ -- FIXME, check the numeric value of ÿ
> );
>
> Notice that you shouldn't try to insert any 8 bit characters in that
> array verbatim.
>
> Then you could replace the part
>
> if (!in_array($char, $latinHex)) {
> return false;
> }
>
> with
>
> if (!isset($latinHex[$char])) {
> return false;
> }
>
> which should result in much more efficient execution. In addition,
> it's now easy to collect the respective Latin1 string with a simple
> catenation of characters. Just replace the above with
>
> if (isset($latinHex[$char]))
> {
> $output .= $latinHex[$char];
> }
> else
> {
> return false; // invalid character
> }
>
> You might want to rename $char to $sequence (or something like that)
> to better describe it's meaning in the above function.
>
>
> An another implementation would be to use the isValidUTF8String()
> function I provided earlier and if the input is UTF-8 string then
> you just do
>
> $latin1 = html_entity_decode(htmlentities($utf8, ENT_NOQUOTES,
> 'UTF-8'));
>
> and check that there's no html entities in the result. This can be
> done easily with a regexp (didn't try to run this):
>
> if (preg_match("@&(?!lt;|gt;|amp;)@",$latin1))
> return false; # UTF-8 input outside Latin1
> else
> return $latin1;
>
> The above regexp tries to match character "&" unless it's part of
> < or > or &
>
> This implementation assumes that htmlentities() doesn't have bugs.
> I'm not sure if that's a safe assumption...
>
--
Allen Shaw
Polymer (http://polymerdb.org)
More information about the talk
mailing list