Thursday, June 19, 2008

Google News Search Leaps Ahead

Google News Search Leaps Ahead

Google has dramatically enhanced its news search service, serving up a portal of real-time news drawn from more than 4,000 sources worldwide.

Until recently, Google's news search has been competent, but less useful than other news-aggregating services such as AllTheWeb's News Search and Yahoo's Full coverage. The new enhancements establish Google as one of the premier news finding and filtering destinations on the web.

Like Yahoo's Full Coverage, Google News Search now looks like a portal, with links to the top headlines organized into categories such as Top Stories, World, Business, Sports and so on. Each category has its area on the News Search home page, with headlines, descriptions and links for the top two or three stories.

"The page looks very different than the average Google page," said Marissa Mayer, Google product manager. That's because it's packed with headlines, descriptions, thumbnail photos and dozens of links to the sources of the articles online.

Unlike Yahoo Full Coverage, however, Google News Search isn't assembled by human editors who select and format the news. Google's process is fully automated. News stories are chosen and the page is updated without human intervention. Google crawls news sources constantly, and uses real-time ranking algorithms to determine which stories are the most important at the moment -- in theory highlighting the sources with the "best" coverage of news events.

Each top story is presented with a headline linked directly to the source. Beneath the headline is a short description, name of the source, and the time when the article was last crawled, ranging from a few minutes to several hours ago.

Beneath the main headline and description are two full headlines from other sources, followed by four or five links to stories with only the name of the publication indicated. Finally, there are links to "related" stories from other sources.

This design makes it easy to quickly scan the headlines while having the option of reading multiple accounts of a story from different news sources -- from literally thousands of sources, for some stories.

Each major category has a link at the top of its respective section that allows you to scan news just within the category. Tabs on the upper left of each page also allow you focus in on Top Stories, World, U.S., Business, Sci/Tech, Sports, Entertainment and Health categories.

Unlike many news aggregators that simply "scrape" headlines and links from news sites, Google's news crawler indexes the full text of articles. This approach offers several unique benefits.

For example, full text indexing allows true searching, rather than just browsing of headlines. Creating a full text index of news also allows Google to cluster related news stories, around what Mayer calls a "centroid" of keywords. "A cluster is defined by a centroid of keywords, and all the articles have some of those key words in them," she said.

The process uses artificial intelligence in addition to traditional information retrieval techniques to match keywords with stories. Mayer says this approach to identifying related articles means that the relative importance of each article is "baked in," which is how the top sources for each story are selected.

Other factors used in calculating the relevance of top and related stories include how recently articles were published, and the reputation of the source. When you actually do a search, these factors are also applied in addition to keyword analysis to determine how closely particular stories match your query.

On search results pages, a link allows you to override the default ranking by relevance and order results by date -- a feature that's particularly helpful for monitoring breaking news.

Google's decision to index the full text of news sources rather than simply scraping headlines posed a major challenge for implementing the new service. The vast diversity and typically cluttered design of most online news formats is more difficult to crawl and index than many other types of web sites. "Article extraction has proven to be one of the most difficult aspects of the project," said Mayer.

Google crawls its 4,000 sources of news continuously and in real time. According to Mayer, the crawler continuously computes what's likely to change on each news source, and when the change is likely to occur. To expedite the discovery of new stories, the crawler tends to hit hub or major section pages frequently, to see what new links are there.

While the news sources are crawled constantly and individual news stories are updated continuously, the entire set of displayed stories is "auto generated" every 15 minutes. A message in the upper right corner of the main news page indicates when it was last generated.

Google's updated news search is an exceptionally powerful tool for web users. It's still in beta, so there are still a few rough edges, but all told it's one of the best news browse and search portals currently operating on the web.

Google News Search
http://news.google.com

News Search Engines
http://searchenginewatch.com/links/news.html

No comments: