[nycphp-talk] enforcing Latin-1 input (follow-up)
Mikko Rantalainen
mikko.rantalainen at peda.net
Wed Nov 30 05:20:47 EST 2005
Allen Shaw wrote:
> With much thanks to Mikko for his help, I've finally figured this out
> enough come up with a way to test for valid Latin-1 input. If I am
> right about this, then the following modification to Mikko's code will
> report whether or not a particular string, assumed to be UTF-8 encoded,
> is within the Latin-1 character set:
>
> -----------8<-----------
> function isValidLatin1String($Str) {
IT seems that this function requires UTF-8 input and doesn't convert
it to Latin1... in which case the name of the function is misleading
at best.
I think it should be called
isValidUtf8PresentationOfLatin1()
instead.
> $latinHex = array ('20', //
> '21', // !
> '22', // "
> // snip for brevity...
> 'c3bf' // ÿ
> );
>
> # While checking for valid UTF-8 stream, compile each character
> # as hex codes and match with latinHex array;
> # correct UTF-8 stream has every character starting with zero bit
> # or first byte has <length of encoding> high bits set and all
> # following bytes have highest bits set to 10.
> for ($i=0; $i<strlen($Str); $i++)
> {
> if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
> else if ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
> else if ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
> else if ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
> else if ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
> else if ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
> else return false; # invalid byte
> # verify that n bytes matching bit sequence 10bbbbbb
> # follow where bbbbbb is not 000000
> # failing this test means that input is "overlong UTF-8
> # encoding", which is not allowed.
> $char = bin2hex($Str[$i]);
There's an error here, if I've understood the code correctly.
Because you have multibyte sequences in the above $latinHex array,
you must also compose the full UTF-8 byte sequence for the input
character to check against that array.
> for ($j=0; $j<$n; $j++) {
> $chara .= bin2hex($Str[++$i]);
Oh, I think that should be |$char .=| instead. In that case, this
part of the loop does the thing I describe above.
> if (($i == strlen($Str))
> || ((ord($Str[$i]) & 0xC0) != 0x80)) {
> return false;
> }
> }
> if (!in_array($char, $latinHex)) {
> return false;
> }
> }
> # couldn't find errors, it's probably valid Latin-1 data.
Again, if there're no errors, then it's (and not just probably in
this case!) a valid presentation of Latin1 string in UTF-8 encoding.
> return true;
> }
You could also implement an easy optimization combined with
additional features in a following way:
change $latinHex array to look like
$latinHex = array(
'21' => "!",
'22' => "\"",
// snip for brevity...
'c3bf' => "\098", // ÿ -- FIXME, check the numeric value of ÿ
);
Notice that you shouldn't try to insert any 8 bit characters in that
array verbatim.
Then you could replace the part
if (!in_array($char, $latinHex)) {
return false;
}
with
if (!isset($latinHex[$char])) {
return false;
}
which should result in much more efficient execution. In addition,
it's now easy to collect the respective Latin1 string with a simple
catenation of characters. Just replace the above with
if (isset($latinHex[$char]))
{
$output .= $latinHex[$char];
}
else
{
return false; // invalid character
}
You might want to rename $char to $sequence (or something like that)
to better describe it's meaning in the above function.
An another implementation would be to use the isValidUTF8String()
function I provided earlier and if the input is UTF-8 string then
you just do
$latin1 = html_entity_decode(htmlentities($utf8, ENT_NOQUOTES,
'UTF-8'));
and check that there's no html entities in the result. This can be
done easily with a regexp (didn't try to run this):
if (preg_match("@&(?!lt;|gt;|amp;)@",$latin1))
return false; # UTF-8 input outside Latin1
else
return $latin1;
The above regexp tries to match character "&" unless it's part of
< or > or &
This implementation assumes that htmlentities() doesn't have bugs.
I'm not sure if that's a safe assumption...
--
Mikko
More information about the talk
mailing list