NYCPHP Meetup

NYPHP.org

[nycphp-talk] Any alternatives to mbstring for PHP+UTF-8?

Paul Houle paul at devonianfarm.com
Thu May 10 21:31:28 EDT 2007


Jakob Buchgraber wrote:
> Hey!
>
> I was wondering whether there are alternatives to mbstring for 
> handling UTF-8 encoded data with PHP?
> I am asking, because I'd like to play around with as many 
> "technologies" as possible before I actually start developing.
> I somehow also looked at the way Joomla! did it, but I don't really 
> like their solution.
>
    Sometimes you can process UTF-8 without doing anything special.  For 
instance,  if you want to pull some text out of a MySQL database and 
display it on a web page,  you can pass the UTF-8 text through without 
using mbstring in PHP:  the one thing you need to do is set the 
character encoding of the HTML document to UTF-8.

    A big strength of UTF-8 is that UTF-8 is compatible with US-ASCII;  
all US-ASCII characters are the same in UTF-8.  This means that you can 
explode on ",",  "\t",  "\n" or a space just like you always do.

    Any regex on Unicode 'characters' can be translated to a regex that 
works on UTF-8 bytes.  This may be awkwards sometimes,  but it can be an 
efficient way to do many operations,  including those that "get under 
the hood" of your language.

    Avoid unnecessary character conversions.  If you can take UTF-8 in,  
process it as UTF-8,  and output UTF-8,  that's really the best.  People 
who work with languages like Java,  that do character conversions for 
you,  often find they're not in control of their character conversions.  
Years ago I discovered that the contents of a postgres database were 
double-encoded...  The bytes that made up the first UTF-8 encoding were 
treated as iso-latin-1 characters,  and re-encoded in Unicode...  If 
you're working with Unicode,  you'll probably need to deal with problems 
like this from time to time.

    The main weakness of UTF-8 is that it's a variable-length encoding.  
That means it's hard to pick out the N'th character of a string.  
mbstring has a function that lets you do this,  but be careful how you 
use it.  Getting the N'th character of a UTF-8 string is an O(N) 
operation,  and iterating over the whole string is O(N^2)...  Ouch.  
Efficient algorithms for UTF-8 tend to work sequentially -- and quite a 
few of them can be translated to string algorithms over the bytes.

    There's no substitute for understanding how Unicode and UTF-8 and 
related representations work -- if you work with it enough,  you'll see 
all kinds of malformed text and you'll need to be able to deal with it.




More information about the talk mailing list