A Simple Semi-Automatic Text Summarizer


This one goes out to all the data geeks in the crowd….

In other posts, I’ve mentioned a text summarizer I’ve used to help me glean the salient points from a large amount of text.  The problem this tool addresses is simple: let’s say you’ve got several articles, or a book, or some other large chunk of text, and you want to discern the larger themes and semantic trends, of the text.  Trying to solve this problem algorithmically in a robust way for arbitrary text with enough quality for a public-facing product is a very hard problem.  But using a simple, time-honored approach borrowed from classical Information Retrieval can be good enough for occasional, personal use.  I’ve been using a simple script to do this, and I recently posted it on the web – give it a whirl:  http://luvogt.com/summ.html

At this point, you may be asking, “What the heck is he talking about?”  Okay, how about some concrete examples. As I mentioned in a previous post, I wanted to quickly discern the most important themes from a series of 41 posts on CNN Money that highlighted the best advice some successful business leaders had ever received.  So I plugged them into my summarizer, and this is what popped out (these are only the top terms):

Let me outline the lay of the land here: these are the words and phrases that appeared in the text of the 41 articles, sorted by a measure of “importance”.  You’ll note that the words have been mangled a bit – this is a process called stemming, which maps different forms of the same word to a semantically similar root (and also does something a little strange – converts the letter “y” to an “i” when at the end of a word).  The point here is that “drinking coffees” is semantically similar to “drinks coffee”, etc.  You’ll also note that some non-meaningful words like “the” and “of” are missing.  The summarizer is just trying to bypass these unimportant words and get to terms that express the core meaning.  The second column of the table is how many times each term has occurred across all of the documents, and the third column is how many unique documents this term has occurred in.  The last column is a magic number that takes the first two columns as well as other things like how long the word is, how many syllables it has, etc., into account to try to guess the importance of that term.  If you are at all familiar with “word clouds”, you could basically render these terms in decreasing font size to get a word cloud.

Interpreting the output takes a little practice, but can help lead to a better understanding of the underlying text, especially for larger chunks of text.  In this case, for example, I scan down the list and ignore terms that appear in most of the articles (like “advice”), or which don’t have a lot meaning (like “because” or “really”).  If I do that, I identify a handful of possibly interesting terms, including: “people”, “company”, “interest”, “start”, “person”, “listen” and “think”.  If I go back and search through the original source documents for the context these words were used in, I quickly discover that one common theme is around establishing and building relationships with others through careful listening.  It’s interesting to note that when I summarized these articles manually, without the help of my tool, I identified “Listening and Respect” as the most important theme.

Here’s another example: when I pointed the tool at this blog, here’s what it said were the most important concepts that I’ve been writing about – I’d say it’s right on:

To be clear, the tool has limitations (not the least is its ability to parse HTML properly and ignore scripts and boilerplate/sidebar text on pages – I may work on that).  Also, I call the tool semi-automatic because it does require some interpretation.  Regardless, I’ve had a lot of fun trying different source texts (like, the Bible, or the Constitution , or the lyrics of various “concept albums”), and would love to hear from you what you discover playing around with it.

It’s also important to understand that the source text should be “of a piece”.  For example, I’ve tried using the tool to summarize the headlines on major news sites, and what comes out is a mish-mash of terms from across the (unrelated) news articles.  In order for it to reveal an underlying theme, there has to be an underlying theme.  Perhaps it may also be valuable in doing just that – revealing a lack of cohesiveness.

I post this under the “leadership” heading because leaders often need to be able to quickly summarize large amounts of  text – as a leader, I can imagine using this kind of tool in a wide variety of ways – from summarizing a stream of tweets or blog, to summarizing a person’s resume (as an example, see mine below).  The possibilities are tantalizing… Here’s that link again:  http://luvogt.com/summ.html