Centroid based Text summarization in Python
June 2, 2020, 2:29 p.m. | Text Mining
It is an Extractive summarization, which extracts important words from document(s) to form a summary. Centroid-based summarization works as identifying the most central sentences in multiple documents that give the necessary and sufficient amount of information related to the main theme of document(s). A common way of identifying the central sentences is to represent the sentences in vector space.
Centrality of sentence is always defined in terms of centrality of word. TF X IDF scores are used to measure the centrality of words. Here, words that have TF X IDF scores above a predefined cosine threshold are centroid of cluster.
In centroid based summarization, the sentences which have more words from centroid of cluster are considered as central sentences. Finally, those sentences are produced as summary of multiple documents.
Centroid based summarization is introduced by Radev, Blair-Goldensohn, and Zhang.
Implementation of Text Summarization in Python
I have implemented the Centroid based text summarization in python, the above-mentioned code is purely for the text summarization which contains method for calculating IDF score of every word and centroid score of every sentence. To understand Python basics refer Python Programming Examples
Centroid score of every sentence is calculated by TF X IDF score of words in the sentence. To easily understand the algorithm of centroid based text summarization, preprocessing techniques like stop word removal and stemming are not given in the above implementation