Mining internal search engine data

We do some limited of search terms on CTSI web properties, but this is a big gap, per user experience author Lou Rosenfeld in his new book Search Analytics for Your Site. Rosenfeld’s the author of the seminal Information Architecture for the World Wide Web, so when he speaks, I tend to pay attention. An interview in O’Reilly Radar digs into the details of what analyzing search data in internal search engines and systems:

“[Site search isn’t] necessarily overlooked by users, but definitely by site owners who assume it’s a simple application that gets set up and left alone. But the search engine is only one piece of a much larger puzzle that includes the design of the search interface and the results themselves, as well as content and tagging. So search requires ongoing testing and tuning to ensure that it will actually work.

“[Site search analytics Does SSA reveal user intent better than other forms of analytics?
I think so, as the data is far more semantically rich. While you might learn something about users’ information needs by analyzing their navigational paths, you’d be guessing far less if you studied what they’d actually searched for. Again, site search data is the best example of users telling us what they want in their own words. Site search analytics is a great tool for closing this feedback loop. Without it, the dialog between our users and ourselves — via our sites — is broken.”

Read more:

“Search Needs a Shake-Up: From Simple Document Retrieval to Question Answering”

If you are thinking about making Internet trawling more efficient, take a look at the recent perspective published in Nature. Researcher Oren Etzioni (whose lab introduced open information extraction) “calls on researchers to think outside the keyword box…”

Open information extraction obviates topic-specific collections of example sentences, and instead relies on its general model of how information is expressed in English sentences to cover the broad, and unanticipated, universe of topics on the Internet.

The basic idea is remarkably simple: most sentences contain highly reliable syntactic clues to their meaning. For example, relationships are often expressed through verbs (such as invented, married or elected) or verbs followed by prepositions (such as invented by, married to or elected in). It is often quite straightforward for a computer to locate the verbs in a sentence, identify entities related by the verb, and use these to create statements of fact. Of course this doesn’t always go perfectly. Such a system might infer, for example, that ‘Kentucky Fried Chicken’ means that the state of Kentucky fried some chicken. But massive bodies of text such as the corpus of web pages are highly redundant: many assertions are expressed multiple times in different ways. When a system extracts the same assertion many times from distinct, independently authored sentences, the chance that the inferred meaning is sound goes up exponentially.

Much more research has to be done to improve information-extraction systems — including our own. Their abilities need to be extended from being able to infer relations expressed by verbs to those expressed by nouns and adjectives. Information is often qualified by its source, intent and the context of previous sentences. The systems need to be able to detect those, and other, subtleties. Finally, automated methods have to be mapped to a broad set of languages, many of which pose their own idiosyncratic challenges.

One exceptional system — IBM’s Watson — utilizes a combination of information extracted from a corpus of text equivalent to more than 1 million books combined with databases of facts and massive computational power. Watson won a televised game of Jeopardy against two world-class human players in February this year. The multi-billion dollar question that IBM is now investigating is ‘can Watson be generalized beyond the game of Jeopardy?’