NYCPHP Meetup

Wed Nov 30 05:20:47 EST 2005

Allen Shaw wrote:
> With much thanks to Mikko for his help, I've finally figured this out 
> enough come up with a way to test for valid Latin-1 input.  If I am 
> right about this, then the following modification to Mikko's code will 
> report whether or not a particular string, assumed to be UTF-8 encoded, 
> is within the Latin-1 character set:
> 
> -----------8<-----------
> function isValidLatin1String($Str) {

IT seems that this function requires UTF-8 input and doesn't convert 
it to Latin1... in which case the name of the function is misleading 
at best.

I think it should be called
isValidUtf8PresentationOfLatin1()
instead.

>      $latinHex = array ('20', //
>      '21', // !
>      '22', // "
>      // snip for brevity...
>      'c3bf' // ÿ
>      );
> 
>      # While checking for valid UTF-8 stream, compile each character
>      # as hex codes and match with latinHex array;
>      # correct UTF-8 stream has every character starting with zero bit
>      # or first byte has <length of encoding> high bits set and all
>      # following bytes have highest bits set to 10.
>      for ($i=0; $i<strlen($Str); $i++)
>      {
>          if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
>          else if ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
>          else if ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
>          else if ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
>          else if ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
>          else if ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
>          else return false; # invalid byte
>          # verify that n bytes matching bit sequence 10bbbbbb
>          # follow where bbbbbb is not 000000
>          # failing this test means that input is "overlong UTF-8
>          # encoding", which is not allowed.
>          $char = bin2hex($Str[$i]);

There's an error here, if I've understood the code correctly. 
Because you have multibyte sequences in the above $latinHex array, 
you must also compose the full UTF-8 byte sequence for the input 
character to check against that array.

>          for ($j=0; $j<$n; $j++) {
>              $chara .= bin2hex($Str[++$i]);

Oh, I think that should be |$char .=| instead. In that case, this 
part of the loop does the thing I describe above.

>              if (($i == strlen($Str))
> 		|| ((ord($Str[$i]) & 0xC0) != 0x80)) {
>                  return false;
>              }
>          }
>          if (!in_array($char, $latinHex)) {
>              return false;
>          }
>      }
>      # couldn't find errors, it's probably valid Latin-1 data.

Again, if there're no errors, then it's (and not just probably in 
this case!) a valid presentation of Latin1 string in UTF-8 encoding.

>      return true;
> }

You could also implement an easy optimization combined with 
additional features in a following way:

change $latinHex array to look like

$latinHex = array(
'21' => "!",
'22' => "\"",
// snip for brevity...
'c3bf' => "\098", // ÿ -- FIXME, check the numeric value of ÿ
);

Notice that you shouldn't try to insert any 8 bit characters in that 
array verbatim.

Then you could replace the part

          if (!in_array($char, $latinHex)) {
              return false;
          }

with

          if (!isset($latinHex[$char])) {
              return false;
          }

which should result in much more efficient execution. In addition, 
it's now easy to collect the respective Latin1 string with a simple 
catenation of characters. Just replace the above with

          if (isset($latinHex[$char]))
          {
              $output .= $latinHex[$char];
          }
          else
          {
              return false; // invalid character
          }

You might want to rename $char to $sequence (or something like that) 
to better describe it's meaning in the above function.

An another implementation would be to use the isValidUTF8String() 
function I provided earlier and if the input is UTF-8 string then 
you just do

$latin1 = html_entity_decode(htmlentities($utf8, ENT_NOQUOTES, 
'UTF-8'));

and check that there's no html entities in the result. This can be 
done easily with a regexp (didn't try to run this):

if (preg_match("@&(?!lt;|gt;|amp;)@",$latin1))
	return false; # UTF-8 input outside Latin1
else
	return $latin1;

The above regexp tries to match character "&" unless it's part of 
&lt; or &gt; or &amp;

This implementation assumes that htmlentities() doesn't have bugs. 
I'm not sure if that's a safe assumption...

-- 
Mikko

NYCPHP Meetup

NYPHP.org

[nycphp-talk] enforcing Latin-1 input (follow-up)