[nycphp-talk] PCRE expression for tokenizing?
Michael B Allen
ioplex at gmail.com
Mon Jul 21 17:48:06 EDT 2008
I trying to write a Wiki syntax tokenizer using preg_match. Meaning I
want to match any token like '~', '**', '//', '=====', ... etc but if
none of those tokens match I want to match any valid printable string.
The expression I have so far is the following:
@(~)|(\*\*)|(//)|(=====)|(====)|(===)|(==)|(=)|([[:print:]]*)@
The problem with this is that the [[:print:]] class matches the entire
input. Strangely if I use [a-zA-Z0-9 ]* instead it works (but of
course I want to support more than ASCII and a space).
Meaning given the input:
[The **fox** jumped //over// the fence]
I want each call to preg_match to return tokens (while advancing the
offset accordingly of course):
[The ]
[**]
[fox]
[**]
[ jumped ]
[//]
[over]
[//]
[ the fence]
Can someone recommend a good PCRE expression for tokenizing like this?
Mike
--
Michael B Allen
PHP Active Directory SPNEGO SSO
http://www.ioplex.com/
More information about the talk
mailing list