[nycphp-talk] SQL Full text searcing and storing
Paul Houle
paul at devonianfarm.com
Sun Apr 15 00:24:18 EDT 2007
Ben Sgro (ProjectSkyline) wrote:
>
> I'm curious what field types, indexes are best for this type of
> application.
>
> I'm also curious about benchmarks. Google can return a huge number of
> records in a fraction of a second.
> Is this a forked process, each doing small amounts of work, or one
> large beefy server doing the transaction?
>
> My DB experiance is mostly mySQL and I would prefer to build using this.
>
Most popular relational databases have a full text search
extension. This includes MySQL, Oracle, MS SQL and Postgres --
unfortunately, these implementations do not correspond to any
standard, so the details are different for different databases. MySQL
has a full-text index that works quite well, with that caveat that it
only works for MyISAM at the moment:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
this means you can have full-text indexes or transactions, not both.
Assuming you have a MyISAM table, implementing fulltext search is as
simple as creating a FULLTEXT index and running queries with the MATCH
operator. You could be benchmarking MySQL with your own data in less
than half an hour (unless it takes more than that to rebuild your
index!) The fulltext index will be updated automatically when you
modify the table, so it's a real "set-it-and-forget-it" proposition.
Postgresql implements FT search with the "tsearch2" extension which
involves a shared library and stored procedures. It feels a lot like
somebodies science project -- you'll need to write your own code to
maintain the index, and dump/restores of your database may be an
adventure, but tsearch2 is flexible and lets you do really neat things.
Years ago I spent a weekend writing some perl scripts that create an
inverted index in mysql tables. It's an inefficient way to do full text
searching, but it lets you do things that other search engines don't
do, such as similarity searches. I loaded about 200,000 short
documents into it on a cheap PC and found I could get interactive
responses (<10 seconds) doing some pretty fancy things. I've used this
to support a few little research projcts.
There are plenty of other specialized full-text engines, such as
Lucene and Xerces, that do a great job, but would require you to do
work to maintain the index.
So far as Google, here's what I can tell you:
* Google almost certainly is based on a distributed main-memory
database. Google keeps it's index in the RAM of a large number of
computers... It's too big to fit in the RAM of one machine, so queries
get split between several machines.
* Google's most critical 'secret' isn't pagerank, but the use of
implicit phrase searching. Older methods of text retrieval don't
consider the ordering of words when scoring -- if you want to get
results like Google, you really do need to score words higher when they
are in proximity to each other, and you need an algorithm that blends
this well with other sources of information.
* Google's other 'secret' is that it's a got a huge amount of text to
worth with. The real intelligence is in the data it searches, ~not~ in
it's algorithms -- Google's algorithms just bring that intelligence
out. You have different problems when you do text retrieval at
different scales: if you've got 100 documents and expect 2 to be
relevant for a query, it's probably not acceptable to have a recall of
50%... For small numbers of documents, you have to work hard to
eliminate false negatives. If you've got 10 billion documents, and 1
million are relevant to a particular topic, you don't really care if
you lose 90% of them -- but you do care that a few really excellent
documents float to the top.
Academic researchers who tried applying algorithms like PageRank to
small data sets (1 million documents) couldn't produce evidence that
PageRank works -- because it doesn't for small data sets.
More information about the talk
mailing list