If you are thinking about making Internet trawling more efficient, take a look at the recent perspective published in Nature. Researcher Oren Etzioni (whose lab introduced open information extraction) “calls on researchers to think outside the keyword box…”
Open information extraction obviates topic-specific collections of example sentences, and instead relies on its general model of how information is expressed in English sentences to cover the broad, and unanticipated, universe of topics on the Internet.
The basic idea is remarkably simple: most sentences contain highly reliable syntactic clues to their meaning. For example, relationships are often expressed through verbs (such as invented, married or elected) or verbs followed by prepositions (such as invented by, married to or elected in). It is often quite straightforward for a computer to locate the verbs in a sentence, identify entities related by the verb, and use these to create statements of fact. Of course this doesn’t always go perfectly. Such a system might infer, for example, that ‘Kentucky Fried Chicken’ means that the state of Kentucky fried some chicken. But massive bodies of text such as the corpus of web pages are highly redundant: many assertions are expressed multiple times in different ways. When a system extracts the same assertion many times from distinct, independently authored sentences, the chance that the inferred meaning is sound goes up exponentially.
Much more research has to be done to improve information-extraction systems — including our own. Their abilities need to be extended from being able to infer relations expressed by verbs to those expressed by nouns and adjectives. Information is often qualified by its source, intent and the context of previous sentences. The systems need to be able to detect those, and other, subtleties. Finally, automated methods have to be mapped to a broad set of languages, many of which pose their own idiosyncratic challenges.
One exceptional system — IBM’s Watson — utilizes a combination of information extracted from a corpus of text equivalent to more than 1 million books combined with databases of facts and massive computational power. Watson won a televised game of Jeopardy against two world-class human players in February this year. The multi-billion dollar question that IBM is now investigating is ‘can Watson be generalized beyond the game of Jeopardy?’