A Simple Semi-Automatic Text Summarizer

This one goes out to all the data geeks in the crowd….

In other posts, I’ve mentioned a text summarizer I’ve used to help me glean the salient points from a large amount of text.  The problem this tool addresses is simple: let’s say you’ve got several articles, or a book, or some other large chunk of text, and you want to discern the larger themes and semantic trends, of the text.  Trying to solve this problem algorithmically in a robust way for arbitrary text with enough quality for a public-facing product is a very hard problem.  But using a simple, time-honored approach borrowed from classical Information Retrieval can be good enough for occasional, personal use.  I’ve been using a simple script to do this, and I recently posted it on the web – give it a whirl:  http://luvogt.com/summ.html

At this point, you may be asking, “What the heck is he talking about?”  Okay, how about some concrete examples. As I mentioned in a previous post, I wanted to quickly discern the most important themes from a series of 41 posts on CNN Money that highlighted the best advice some successful business leaders had ever received.  So I plugged them into my summarizer, and this is what popped out (these are only the top terms):

Let me outline the lay of the land here: these are the words and phrases that appeared in the text of the 41 articles, sorted by a measure of “importance”.  You’ll note that the words have been mangled a bit – this is a process called stemming, which maps different forms of the same word to a semantically similar root (and also does something a little strange – converts the letter “y” to an “i” when at the end of a word).  The point here is that “drinking coffees” is semantically similar to “drinks coffee”, etc.  You’ll also note that some non-meaningful words like “the” and “of” are missing.  The summarizer is just trying to bypass these unimportant words and get to terms that express the core meaning.  The second column of the table is how many times each term has occurred across all of the documents, and the third column is how many unique documents this term has occurred in.  The last column is a magic number that takes the first two columns as well as other things like how long the word is, how many syllables it has, etc., into account to try to guess the importance of that term.  If you are at all familiar with “word clouds”, you could basically render these terms in decreasing font size to get a word cloud.

Interpreting the output takes a little practice, but can help lead to a better understanding of the underlying text, especially for larger chunks of text.  In this case, for example, I scan down the list and ignore terms that appear in most of the articles (like “advice”), or which don’t have a lot meaning (like “because” or “really”).  If I do that, I identify a handful of possibly interesting terms, including: “people”, “company”, “interest”, “start”, “person”, “listen” and “think”.  If I go back and search through the original source documents for the context these words were used in, I quickly discover that one common theme is around establishing and building relationships with others through careful listening.  It’s interesting to note that when I summarized these articles manually, without the help of my tool, I identified “Listening and Respect” as the most important theme.

Here’s another example: when I pointed the tool at this blog, here’s what it said were the most important concepts that I’ve been writing about – I’d say it’s right on:

To be clear, the tool has limitations (not the least is its ability to parse HTML properly and ignore scripts and boilerplate/sidebar text on pages – I may work on that).  Also, I call the tool semi-automatic because it does require some interpretation.  Regardless, I’ve had a lot of fun trying different source texts (like, the Bible, or the Constitution , or the lyrics of various “concept albums”), and would love to hear from you what you discover playing around with it.

It’s also important to understand that the source text should be “of a piece”.  For example, I’ve tried using the tool to summarize the headlines on major news sites, and what comes out is a mish-mash of terms from across the (unrelated) news articles.  In order for it to reveal an underlying theme, there has to be an underlying theme.  Perhaps it may also be valuable in doing just that – revealing a lack of cohesiveness.

I post this under the “leadership” heading because leaders often need to be able to quickly summarize large amounts of  text – as a leader, I can imagine using this kind of tool in a wide variety of ways – from summarizing a stream of tweets or blog, to summarizing a person’s resume (as an example, see mine below).  The possibilities are tantalizing… Here’s that link again:  http://luvogt.com/summ.html

Netizen, Choose Your Ecosystem!

Let’s face it, the time has finally come.  The “cloud” is here and for the most part it works wonderfully, but there are strings attached, and unfortunately you have to pick sides.  Okay, you don’t have to pick sides.  You could just take whatever comes your way and make decisions as you go, but that would unnecessarily complicate your life. Technology is designed to do exactly the opposite.  However, all technology, from the earliest stone tools to the latest gadget, requires some thought and training to be used properly.*  With the ubiquity of cloud services and mobile devices (currently in the U.S., roughly half the population has smartphones, and tablets are selling like hotcakes), a plethora of options have surfaced that were not there before, run by both big guys and up-starts alike.

The cloud means a lot of things to a lot of people, so let’s clarify what I’m talking about here.  By “the cloud” I mean that set of services that a “large” portion of device users engage with on a “regular” basis.  In other words, the minimum set of services that a provider must offer that work well across devices, and which are well-integrated both with each other and the devices.  Sound vague?  Well, let’s make it concrete.  Here are some “must-have” services that I think most users would agree they would like or need:

  • email and messaging [Ap, F, G, M, Y]
  • calendar and address book [Ap, F, G, M, Y]
  • web search [G, M, Y]
  • product search [Am, G, M, Y]
  • news [G, M, Y]
  • voice and video calling [Ap, G, M, Y]
  • social networking [F, G]
  • maps, directions, and local search [Ap, G, M, Y]
  • photo/video/file sharing [F, G, Y]
  • buy/rent and consume media (movies, books, music) [Am, Ap, G]
  • document, spreadsheet & preso editing and sharing [G, M]
  • backing up your files [Ap, G, M]

Those funny colored letters in brackets are shorthand for the “big six” technology companies – the heavy hitters in the world of offering internet based services to the public: Amazon, Apple, Facebook, Google, Microsoft, and Yahoo.**  I’ve noted which of these companies currently has a strong offering for each of these services.  While we could debate my definition of “strong offering,” it is still instructive to scan down the list to see which companies have the most comprehensive portfolio. The order looks something like: Google, Microsoft, Yahoo, Apple, Facebook, Amazon.  This could change with time, but that’s how things stand now.***

There are other important factors aside from feature sets that revolve more around the company itself.  Some I can think of include:

  • someone you can trust. [Am, Ap, G, M, Y]
  • someone who lets you own your data. [G, M?, Y?]
  • someone who takes security and privacy seriously. [Am, Ap, F?, G, M, Y]
  • someone who’ll be around for a long time. [Am, Ap, F, G, M, Y?]
  • someone who has their own OS and devices. [Am, Ap, G, M]
  • someone who has a good track record of things “just working.” [Am, Ap, F, G]

Personally, taking all of the above factors into consideration, I’ve decided to “go Google,” to use the marketing phrase that basically means moving all of your usage of online service applications (like word processing, email, social networking, etc.) to Google’s cloud based systems.  In full disclosure, last summer I took a job at Google (and so far it’s been great), so I’ve already gone to Google, but now I’m “going Google.” Although the former did accelerate the latter, I was already leaning that way to begin with, so it likely would have happened anyway.   In fact, it’s more likely that I chose to work at Google because it is the best choice in public cloud services. For me, it’s pretty much a no-brainer – Google  has all of the services and all of the right company characteristics.   I’m not the only one coming to these two conclusions – technology writer John Battelle has sided with Google in what he calls the “cloud commit conundrum,” and renowned inventor Ray Kurzweil has recently joined Google.

Like this angry guy, you may resist the necessity to choose an ecosystem, but the reality is that as a netizen, living in the cloud and trusting someone with your data will only become more and more inevitable over time.  Just ask Bruce Sterling.  As John Battelle points out, by just buying a device, you are already implicitly partially committing to one of the players. You could spread your data across providers to hedge your bets, i.e., not put all your data in one “basket.”  But I would argue that you will probably end up paying for it in the long run with painful migrations or complete loss of data.  Perhaps that is a price you are willing to pay to prevent one company from knowing “too much” about you.

Keep in mind you’ll also be paying another cost – lost opportunity for truly integrated and personalized services.  Amazing things like swapping out devices, and having everything just work – all of your stuff, your preferences, your user model, your personal assistant, will just be there and work. As an example, I’ve drafted this article over the course of several months, seamlessly using six different devices (2 Macs, a PC, iPad, iPhone, and Nexus tablet) and Google Drive apps.  If I had just been limited to using one of those devices, I’d never have finished.  As a second example, after recently getting a Nexus 7 tablet, I’ve been able to experience a great new product called Google Now – a service that automatically makes relevant suggestions for you personally.  When I first brought it up, it already had a suggestion on how to navigate to a restaurant I had searched for earlier that day from my iPhone – how cool is that?  If I had been using a different search engine, that never would have happened.

Does this mean I’ll stop using the other 5 “big guys?”  No, of course not.  Not only do they provide some services I can’t get at Google (yet!), but as I’ve explained in the past, considering the business I’m in, I can’t afford to ignore the competition.  But it does mean that I’ll limit my use of their services where they overlap with Google’s, which is in a lot of places.

It’s time to choose your ecosystem – what factors will you consider?

* I visualize a prehistoric parent carefully demonstrating how you never cut towards yourself with a stone implement.

** You could throw in a few others into the mix here, like AOL, eBay, LinkedIn, and Twitter, but their service portfolios are not at the same level as the others (yet!).  As an aside, I am amazed that when other people make similar lists, they exclude Yahoo.  Its offerings are far too broad and user base far too large to ignore, regardless of its mediocre track record.  And with a new CEO at the helm, things could easily turn around for them.

*** In fact, you could look at the places where companies have gaps in this list and the next to see where they might be headed.

Don’t Fear the Filter

reaper (rē’pər) n.  1. One that reaps, especially a machine for harvesting grain”

grain (grān) n. 13b. An essential quality or characteristic.” – answers.com

There’s been a lot of  FUD spreading around the net about a “filter bubble.”  In my year-end review of TED videos, I even noted that Eli Pariser gave a talk on the topic in that highly regarded forum.  As someone deeply involved with designing and implementing one instance of such a filter, I’d like to send a simple message – don’t fear the filter.  After all, you’ve been living in a filtered media world since you were born – it was just filtered by humans, not algorithms.  Neither humans, nor the algorithms they design, are infallible, and the assumption that humans will necessarily provide better curation of content is not only an untested hypothesis, but completely disregards the impracticality of an only-human approach to the information glut we now face.  While there are legitimate reasons to be concerned, the important thing to realize is that the designers of the algorithms are already acutely aware of these potential problems, and attempt to address them when solving the very hard problem of algorithmic content curation.

For example, Pariser expresses deep concern about the “invisible editing” done “without my permission” that results in “showing us what it thinks we want to see instead of what we need to see”.  Of course, he’s talking about “the internet” (or, more precisely, the major players in content on the internet), but those same statements could equally apply to the editorial staff in any traditionally run media outlet.  Yes, humans are typically a better judge of quality and it’s true that algorithms don’t yet have the embedded ethics that editors have.  But imagine if we lived in a world where every time you needed to find something on the internet, you had to rely on a human to find it for you instead of a search engine.  In reality, the internet has been in a “filter bubble” ever since the first web search engines came online over 15 years ago.  And while it’s true that more and more content is being run through filters, that’s because the naive approach to using humans doesn’t scale, and the amount of information is certainly growing.

Unfortunately, the typical characterization of personalization technology, i.e. that the algorithms are mainly looking at what you click on first, is a gross simplification that doesn’t do justice to the complex algorithms used in the field of content recommendation.  Yes, clicks are a tremendously important input signal to the algorithms, but they are by no means the only signal and most importantly, they are not sufficient in and of themselves to build a system that could compete with the likes of a human editor.  And yes, it is often the case that what the algorithm is trying to optimize is clicks.  But not always, and not only.  More importantly, any decent algorithm has to take into account the “wisdom of the crowds” – in other words, it’s not just what you click on, but what others click on, especially others that are similar to you in some way (your friends, followers, or other cohorts, however that’s defined).  So, in reality, there are humans in the mix.

At the core, the concern seems to be about what Pariser calls the “self-looping and fragmenting effects” that can result from the use of learning too well what a person likes.  But any Machine Learning practitioner and filter builder is so acutely aware of this effect (variously known as “overfitting” or “local optima” or “explore vs. exploit” or by a number of other technical terms in a myriad of flavors depending on what particular aspect of the problem you are looking at), that it hardly even bears mentioning.  It’s kind of like asking a truck driver to make sure he doesn’t run out of gas on his cross-country trip.  Those of us who work on this are always trying to avoid getting “stuck in a rut” by making sure we throw in enough variety and diversity to promote discovery.

At Yahoo (where I currently work), we do take a hybrid approach, at many levels, of incorporating humans into the algorithm.  But we are also always looking for ways to take humans out of the loop when that makes sense, which is often.  We can’t possibly serve up relevant content to hundreds of millions of people across the globe without some big data science and heavy lifting done by machines.

To their credit, the filter-fear-mongers do make a few points that I particularly like – for example,  the suggestion that the algorithms need to be transparent enough and that people need control over how it works.  This part is particularly challenging for those of us doing Machine Learning for recommendation, primarily because the techniques we use are often not readily amenable to transparency and explicit control.  Regardless, I certainly agree  – although the vast majority of people will be more than happy with the default settings of the algorithm, we always need, like in Star Trek, a manual override.  Presenting the user with the thousands of words, phrases, and other features we use to model their preferences just won’t work (especially after we’ve projected their “features” into a low-dimensional subspaces in an effort to divine their latent semantics).

I’m mostly not worried about filters because their designers will naturally be kept on their toes by the people using the filters.  If your filter only suggests a narrow range of content, then people will stop using it or at least complain.  Pariser used the example having his conservative friends filtered from his Facebook stream.  Well, he wasn’t the only one to complain, and Facebook re-instituted the option to get an unfiltered time-sorted newsfeed.

In the end, you are your own best filter.  As information gets easier and easier to publish, and as it gets more central to all of our lives and careers, the volume becomes unmanageable, and it behooves us to become active in our quest for relevant information, and not just swallow whatever the “big guys” decide to publish.  Filters will be one of the indispensable tools for helping us do just that.

The Search Tools They are A-Changing

This evening, Yahoo is announcing some cool enhancements to their Search functionality. You should check them out, and I’m not just saying that because I work for Yahoo Search, but because after several years of posturing by the big search engines, the paradigms for search on the web are finally actually shifting.

The big three engines (Google, Yahoo, and Bing) have finally started to diverge, much like the auto industry did in its early years. At first there was a plethora of small companies, mostly making custom vehicles, with varying levels of quality. Similarly, in the early days of the web, there arose a plethora of search engines (Lycos, HotBot, Alta Vista, Infoseek, etc. ad infinitum). Eventually, when Ford showed the superiority of the assembly line, consolidation began, and there became a correct and accepted definition of what a car was. And likewise, Google led the way in ushering in high levels of relevance, comprehensiveness, and speed – with Yahoo (and later Bing) following in tow, and at times even surpassing the leader. But those features are just table stakes now. It’s the natural progression in a product development life cycle in a competitive market: first match your competition, then differentiate.

I’ve purposely used the term “Search Tools” in the title of this article instead of “Search Engines” because now that we’ve finally got the underlying engines (or, platforms, to use the technical term) in place for search, the fun is just now starting! Google, Yahoo, and Bing are all starting to really innovate and offer different ways of searching. Just like cars, each has its own personality. (And, just like cars, sometimes they share some of the core elements, like Yahoo and Bing are doing for their “algorithmic web” and advertising content.)

To continue the auto metaphor, if you had an SUV, an all-electric sedan, and a sports car sitting in your driveway, which would you use? The answer is: it depends. Going on a ski trip? Heading out to pick up the kids? Need to get out and blow off some steam? It depends. The same will soon be true of your search tool – and since we all have all of these tools at our disposal, why not use them all?

Google recently launched “Google Instant”, which shows you search results as you type. It is really slick and really fast. Yahoo doesn’t have it, and neither does Bing. It is a fundamentally different way of interacting with a search tool. Some people love it, some hate it. And you can choose whether you want to use it. Personally, I find it too distracting – I don’t always want to live my life like I’m hopped up on caffeine. But I am not you, and you may find it a godsend. Give it a whirl and see.

Likewise, Yahoo just launched some pretty cool features around entertainment searches (like searching for movies, actors, musicians, TV shows, famous people, etc.) as well as searches about newsworthy topics. We’re able to recognize these real-world entities and give you all of the most relevant information and the ability to get things done right there at the top of the search results page. As an example, try searching for “The Social Network” on Yahoo. Not only do you immediately see ratings, showtimes in your area, and a link to a trailer, but you can also buy tickets and if you are a Netflix user, you can immediately add it to your queue (and if available, you can Instant Watch it – I especially like this last feature – so cool!) And, as you flip through the “accordion” tabs of Stories and Twitter, you get see see the most recent and relevant content – at your own pace. We’ve collected all of the “good stuff” in one place for you to browse at your leisure. If you do another search like “Lady Gaga“, the accordion changes accordingly to include News, Events (Shows), Albums, Videos, and Twitter. And get this: if you search for “Lady Gaga albums“, we’re smart enough to take you directly to the right tab. Collecting all the good, trusted stuff in one place – isn’t that what Yahoo was always known for?

Just like you wouldn’t drive your sports car on a camping trip, you can’t expect to always get the best results from your trusty old search engine anymore. So I invite you to explore. If you’re a Google user, try Yahoo for a week. If you’re a die-hard Yahoo fan, give Bing or Google a try. And keep exploring, because this is just the beginning of some exciting stuff in search.