Centroid based Text summarization in Python

2018-08-08

Introduction

It is an Extractive summarization, which extracts important words from document(s) to form a summary. Centroid-based summarization works as identifying the most central sentences in multiple documents that give the necessary and sufficient amount of information related to the main theme of document(s). A common way of identifying the central sentences is to represent the sentences in vector space.

Centrality of sentence is always defined in terms of centrality of word. TF X IDF scores are used to measure the centrality of words. Here, words that have TF X IDF scores above a predefined cosine threshold are centroid of cluster.

In centroid based summarization, the sentences which have more words from centroid of cluster are considered as central sentences. Finally, those sentences are produced as summary of multiple documents.

Centroid based summarization is introduced by Radev, Blair-Goldensohn, and Zhang.

Implementation of Text Summarization in Python

from os import listdir
import string
import math

"""Method to calculate Inverse Document Frequency Score"""
def calculate_idf(word):
    files = [f for f in listdir("E:\Works\Products\doc") ]  #Specify the directory where the documents located
    count,wcount=2,1    
    for file1 in files:
        file=open("E:\Works\Products\doc\\" +file1,'r')     #Specify the directory where the documents located
        page=file.read()
        if(word in page):
            wcount+=1
        count+=1
    idf=count/wcount
    
    return math.log(idf,10)

"""Method to calculate Centroid Score of sentences"""
def calculate_centroid(sentences):
    
    """"Compute tf X idf score for each word"""
    tfidf=dict()
    for sentence in sentences:
        words=sentence.split()
        for word in words:
            if word in tfidf:
                tfidf[word]+=calculate_idf(word)
            else:
                tfidf[word]=calculate_idf(word)

    """Construct the centroid of Cluster
    By taking the words that are above the threshold"""

    centroid=dict()
    threshold=0.7
    for word in tfidf:
        if(tfidf[word]>threshold):
            centroid[word]=tfidf[word]
        else:
            centroid[word]=0

    """Compute the Score for Sentences"""
    senctence_score=list()
    counter=0
    for sentence in sentences:
        senctence_score.append(0)
        words=sentence.split()
        for word in words:
            senctence_score[counter]+=centroid[word]
        
        counter=counter+1
    return senctence_score


"""Splitting Documents as sentences"""
files = [f for f in listdir("E:\Works\Products\doc") ]
page=""
for file1 in files:
    file=open("E:\Works\Products\doc\\" +file1,'r')
    page+=file.read()
    file.close()
sentences=page.split(".")
senctence_score=calculate_centroid(sentences)
    

"""Printing Sentences which has more central words"""
for i in range(len(sentences)):
    if(senctence_score[i]>15):
        print(sentences[i])

Implementation Details

I have implemented the Centroid based text summarization in python, the above-mentioned code is purely for the text summarization which contains method for calculating IDF score of every word and centroid score of every sentence. To understand Python basics refer Python Programming Examples

Centroid score of every sentence is calculated by TF X IDF score of words in the sentence. To easily understand the algorithm of centroid based text summarization, preprocessing techniques like stop word removal and stemming are not given in the above implementation