Information Extraction- Ready for Prime Time?

Oren Etzioini of the University of Washington held a talk at Adobe in March, and gave a rundown on the current state of the art in IE.  We’ll get to that in a minute, but what is IE?  Information Extraction is the science of making sense of unstructured human text.  The challenge is that human language can be imprecise.  Structured data is so named because of the systematic categorization of the data into tables in a way that optimizes its analysis.  Unstructured data, as in human speech, does not lend itself to tables nor structure.  In analyzing human language, it is common to employ natural language processing to create a system that will derive useful information from human language.  This may not be possible in politics, but perhaps in business it could work.

Why is this useful?  Today’s technology allows us to ask “What is the best Mexican restaurant in San Jose” and get a ranking by star ratings that users have input.  IE allows us to ask “Where can I get the best margarita in San Jose?”  and get a ranking by comments about margaritas.  To get a ranking based on attributes that weren’t defined in advance, queries require a more advanced understanding of what is being said in reviews, not just star ranking. 

How do you analyze unstructured text?  The key to answering the attribute based questions is context.  Information extraction is machine learning.  Algorithms will attempt to determine what is relevant.  Scalability combined with algorithms are the keys to generate useful results,.  The IE model identifies the tuple and the probability of a relationship.  An example might be trying to find out who invented the light bulb, and getting results such as: invented(Edison, light bulb), 0.99, indicating a strong link between Edison and inventing the light bulb. 

If one is looking for examples of some attribute, they often occur in context with other terms, which we might consider as clues.  One can then use clues to find more instances of the attribute.  This is how we pick apart context in more detail.    

The challenge is extracting the information we’re looking for, and ignore the rest.  One of the more interesting applications is a shopping tool, decide.com that will check different websites for rumors about new product introductions.  It even goes further to estimate when a new product might come along based on the company’s previous history, what will happen to the pricing, and what kind of features are being talked about.  It creates a summary of rumors compiling the results of multiple sources, saving time, and results can be displayed on a mobile device.

Dr. Etzioni’s pet project is Open IE.  His premise is that word relations have canonical structure.  By looking at this structure you extract the relationships for analysis.  You don’t generally pre-identify the concepts, you want to be able to find interesting stuff.  His extractors find these relationships.  You can play with his model on the web at openie.cs.washington.edu or get Open IE extractors for download without license fees.

There are some early IE efforts out there today.  Google Knowledge Graph and Facebook Graph Search are a couple.  Oren is part of a startup, Decide.com, that is also in the space.  All of them are relatively early stages as algorithms improve the usefulness of the data will improve.  People want an answer, not a bunch of results to sort through.  This becomes increasingly important as we are increasingly looking for these results on our mobile device.  This forces a more succinct response, like talking to a person, and getting an answer.  Oren did mention that Siri, which can respond to a query with an answer is very limited.  He wants to use all available documents, tweets, reviews, posts, blogs and everything he can get his hands on to formulate an answer.

Check out his extractor on the website mentioned above for more information, and a free test drive.

In answer to the question, IE is almost ready for prime time.  There are promising signs for this technology, but I wouldn’t bet my house on it yet.

 

Advertisements

About Big Data Perspectives

Erik Ottem has over 25 years of technology experience with IBM, Seagate, Gadzoox Networks and Agilent. Observations and comments about Big Data are presented for your review.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s