NYCPHP Meetup

Wed Nov 30 12:06:17 EST 2005

Thanks Mikko.  This is very, very helpful.  As of a couple of months ago 
I'm half a country away from any NYPHP meeting, but maybe I should mail 
you a beer from Texas.  :o)

- Allen

Mikko Rantalainen wrote:
> Allen Shaw wrote:
> 
>>With much thanks to Mikko for his help, I've finally figured this out 
>>enough come up with a way to test for valid Latin-1 input.  If I am 
>>right about this, then the following modification to Mikko's code will 
>>report whether or not a particular string, assumed to be UTF-8 encoded, 
>>is within the Latin-1 character set:
>>
>>-----------8<-----------
>>function isValidLatin1String($Str) {
> 
> 
> IT seems that this function requires UTF-8 input and doesn't convert 
> it to Latin1... in which case the name of the function is misleading 
> at best.
> 
> I think it should be called
> isValidUtf8PresentationOfLatin1()
> instead.
> 
> 
>>     $latinHex = array ('20', //
>>     '21', // !
>>     '22', // "
>>     // snip for brevity...
>>     'c3bf' // ÿ
>>     );
>>
>>     # While checking for valid UTF-8 stream, compile each character
>>     # as hex codes and match with latinHex array;
>>     # correct UTF-8 stream has every character starting with zero bit
>>     # or first byte has <length of encoding> high bits set and all
>>     # following bytes have highest bits set to 10.
>>     for ($i=0; $i<strlen($Str); $i++)
>>     {
>>         if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
>>         else if ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
>>         else if ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
>>         else if ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
>>         else if ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
>>         else if ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
>>         else return false; # invalid byte
>>         # verify that n bytes matching bit sequence 10bbbbbb
>>         # follow where bbbbbb is not 000000
>>         # failing this test means that input is "overlong UTF-8
>>         # encoding", which is not allowed.
>>         $char = bin2hex($Str[$i]);
> 
> 
> There's an error here, if I've understood the code correctly. 
> Because you have multibyte sequences in the above $latinHex array, 
> you must also compose the full UTF-8 byte sequence for the input 
> character to check against that array.
> 
> 
>>         for ($j=0; $j<$n; $j++) {
>>             $chara .= bin2hex($Str[++$i]);
> 
> 
> Oh, I think that should be |$char .=| instead. In that case, this 
> part of the loop does the thing I describe above.
> 
> 
>>             if (($i == strlen($Str))
>>		|| ((ord($Str[$i]) & 0xC0) != 0x80)) {
>>                 return false;
>>             }
>>         }
>>         if (!in_array($char, $latinHex)) {
>>             return false;
>>         }
>>     }
>>     # couldn't find errors, it's probably valid Latin-1 data.
> 
> 
> Again, if there're no errors, then it's (and not just probably in 
> this case!) a valid presentation of Latin1 string in UTF-8 encoding.
> 
> 
>>     return true;
>>}
> 
> 
> You could also implement an easy optimization combined with 
> additional features in a following way:
> 
> change $latinHex array to look like
> 
> $latinHex = array(
> '21' => "!",
> '22' => "\"",
> // snip for brevity...
> 'c3bf' => "\098", // ÿ -- FIXME, check the numeric value of ÿ
> );
> 
> Notice that you shouldn't try to insert any 8 bit characters in that 
> array verbatim.
> 
> Then you could replace the part
> 
>           if (!in_array($char, $latinHex)) {
>               return false;
>           }
> 
> with
> 
>           if (!isset($latinHex[$char])) {
>               return false;
>           }
> 
> which should result in much more efficient execution. In addition, 
> it's now easy to collect the respective Latin1 string with a simple 
> catenation of characters. Just replace the above with
> 
>           if (isset($latinHex[$char]))
>           {
>               $output .= $latinHex[$char];
>           }
>           else
>           {
>               return false; // invalid character
>           }
> 
> You might want to rename $char to $sequence (or something like that) 
> to better describe it's meaning in the above function.
> 
> 
> An another implementation would be to use the isValidUTF8String() 
> function I provided earlier and if the input is UTF-8 string then 
> you just do
> 
> $latin1 = html_entity_decode(htmlentities($utf8, ENT_NOQUOTES, 
> 'UTF-8'));
> 
> and check that there's no html entities in the result. This can be 
> done easily with a regexp (didn't try to run this):
> 
> if (preg_match("@&(?!lt;|gt;|amp;)@",$latin1))
> 	return false; # UTF-8 input outside Latin1
> else
> 	return $latin1;
> 
> The above regexp tries to match character "&" unless it's part of 
> &lt; or &gt; or &amp;
> 
> This implementation assumes that htmlentities() doesn't have bugs. 
> I'm not sure if that's a safe assumption...
> 

-- 
Allen Shaw
Polymer (http://polymerdb.org)

NYCPHP Meetup

NYPHP.org

[nycphp-talk] enforcing Latin-1 input (follow-up)