Contextics

USPTO App: Simplifing the Long Tail of Search Queries (0070100795)


« 5 October 2007 | No Replies »

United States Patent Application: 0070100795

The second (and most valuable perhaps) of three major inventions I made and patented while at Yahoo! Research.

All three are based on a form of collaborative filtering of search result url’s, using modified search engine similarity metrics, and the same’s inverted index techniques for performance. This solves the tail quey rewrite problem - mapping extremely rare “tail” search queries into more common, (and more importantly bidded), search queries. The Search Engine then displays ads applied to these keywords. Experiments showed very high levels of coverage.

A simple explanation: Most rewrite systems have used either orthographic (spell checking) solutions, or thesaurus-term-based (eg auto = car = automolbile) solutions (or in practice a statistically-based mix of these). This is fairly effective, however, it fails horribly for reasonably simple problems. For example - a user types: “Marius De Vries” - say what? (my very reaction in fact). However - if you look at the url result set and compare it to the top N million search phrases, then it becomes very quickly obvious, this guy was the Musical Director of Moulin Rouge, and the best search term rewrite is “Moulin Rouge” (and suggesting displaying ads for the CD or Movie).

A relatively recent paper from Google tries to prove that this approach has a theoretical flaw. However the proof is based on the total number of concepts in the world approaching infinity. This does not appear to be a reasonable assumption. Certainly a large number of concepts exist — into the Ns of billions (eg each individual person), but from a practical point of view, the amount of “discourse” in a search engine is relatively finite, and our technique provides a way to map 10000s of different ways of talking about a thing to a small set of terms that are MOST used for talking about it.