Skip to main content

IBM Many Aspects Document Summarization Tool

A tool that highlights various angles of a document's content.

Date Posted: July 1, 2008

alphaworks tab navigation

 

What is IBM Many Aspects Document Summarization Tool?

IBM® Many Aspects Document Summarization Tool is a document summarization system that ingests a text document and automatically highlights a set of sentences that are expected to cover the different aspects of the document's content. The user decides the number of sentences to be included in the summary. These sentences are picked using the following two criteria:

The demand of a summary with high coverage and high orthogonality is amplified by today's Web 2.0 applications. For example, in online comments and discussions following blogs, videos, and news articles, it is desirable to have a summary that highlights different angles of these comments because each often has a different focus. With IBM Many Aspects Document Summarization Tool, you can get a concise yet comprehensive overview of the document without having to spend lots of time drilling down into the details.

For comparative analysis and exploratory flexibility, the system also includes other off-the-shelf text summarization methods, such as k-median clustering and singular value decomposition (SVD). Thus the system allows you to explore the content of the input document in many different ways.

How does it work?

The core algorithm is based on a novel combinatorial formulation for the document summarization problem and a greedy search strategy; this formulation captures both coverage and orthogonality requirements. You simply load a plain-text document and give the length of the summary; the system will automatically identify and highlight the most important sentences.

The entire system consists of the following three modules:

Theoretically, for a document with m sentences and n unique words, the total running time of the algorithm is O(k*m*n), which is linear to the size of the input. However, the complexity of SVD, which is a very popular approach for text summarization, is O(m^2*n + m*n^2 + k*m*n).

The algorithm takes less than a couple of seconds to compute summaries consisting of ten sentences from a document of 57 KB (approximately 566 sentences). But SVD takes approximately four minutes for the same task.

For further information, please see "ManyAspects: A System for Highlighting Diverse Concepts in Documents," by Kun Liu, Evimaria Terzi, and Tyrone Grandison, published by International Conference on Very Large Data Bases (VLDB), Auckland, New Zealand, in August 2008.

About the technology author(s)

Kun Liu, Ph.D., joined the IBM Almaden Research Center as a postdoctoral researcher in January 2007. He is working with the Intelligent Information Systems team on healthcare informatics and privacy-preserving social network analysis. Dr. Liu earned his Ph.D. in computer science from University of Maryland, Baltimore County.

Evimaria Terzi, Ph.D., has been a researcher at IBM Almaden since June 2007. She obtained her Ph.D. from the University of Helsinki, Finland, in January 2007; her M.Sc. from Purdue University in 2002; and her B.Sc. from the University of Thessaloniki, Greece, in 2000. Her research interests are in the area of algorithmic data mining with applications to social-network analysis, information retrieval, and databases.

Tyrone Grandison, Ph.D., leads the Intelligent Information Systems in the Healthcare Informatics group at the IBM Almaden Research Center. Dr. Grandison received his B.Sc. and M.Sc. from the University of the West Indies, Jamaica, and his Ph.D. from the Imperial College of Sciences, Technology, and Medicine in the University of London, United Kingdom. He is a senior member of both the Association of Computer Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE).

Trademarks




Related technologies