In this article we will explain how our summarization algorithm works, to create a summary of the text or an audio file.
Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. In addition to text, images and videos can also be summarised.
NLP (Natural Language Processing) Module is an application to use in the summarization algorithm, and here is the method of how to implement it with Spacy.
These are the following ideas that are helpful to create a summarization algorithm:
- Text Preprocessing (remove stopwords, punctuation).
- Frequency table of words/Word Frequency Distribution — how many times each word appears in the document
- Score each sentence depending on the words it contains and the frequency table
- Build summary by joining every sentence above a certain score limit
1) Import spacy package.
2) Import Text processing package, (STOP_WORDS, punctuation)
3) get the list of the stop words from the spacy.
4) Store the information of the original text in a variable.
5) load the English module of spacy to work on English language data. and create an NLP object.
6) Splitting each word of sentences of the paragraph to create tokens.
7) Iterate through the list and check if the corresponding word is not present in the stop words list and if not, increase its frequency by 1. and create the list variable according to the frequency of the word.
8) At this point having generated the frequencies of each word in the given data, normalize the frequencies by extracting the maximum of all frequencies which was generated in the earlier step and divide frequency of each word by the maximum frequency.
9) Now there is a need to generate scores of each sentence in the data to generate an optimal summary of the given data. For that first tokenize each sentence in the data.
10) Now, having generated a list of sentences and also a list of
words. So, generate a score for each sentence by adding the
weighted frequencies of the word that occurs in a particular
sentence. As it is not interested in a summary, it has to score
only those sentences with less than 30 words.
11) In the earlier step, the output is generated as a dictionary of sentences with their scores. Now, select the top N sentences by passing our sentence scores dictionary along with its values and the required N sentences to the n-largest function in the NLTK library and it turns into a list with sentences according to their sentence scores.
12) Further, convert the sentences to strings and then Finally Join them to generate the finalized summary.
13) Final Output
Summarization of audio
The audio file recorded needs to be converted into text. This process uses python module SpeechRecognition. This module requires a working internet connection to function as it uses Google’s web-based speech to text engine. The time duration for which the engine accepts input is a maximum of 1 minute. So, if the duration of the audio file increases beyond 1 minute then we need to divide it into chunks of smaller duration.
The process of dividing the audio file into chunks of processable duration is carried out by using mathematical formulae by calculating the size of the audio file by considering the frame rate of an audio file, the original duration of the audio file, and the bit rate and then dividing the size of the audio file by the file split size. We then get the total number of numbered chunks and in wav format as well. The python module required for getting the audio file parameters is pydub.
The mathematical formulae used for the same are:
wav_file_size = (sample_rate * bit_rate * channel_count * duration_in_sec) / 8 ……………………. (1)
file_split_size = 1000……………………………… (2)
total_chunks = wav_file_size // file_split_size…… (3)
chunk_length_in_sec = math.ceil((duration_in_sec * 10000000) /wav_file_size) …………………………………(4)
chunks = make_chunks(myaudio, chunk_length_ms) ……. (6)
The individual chunks of files are then processed further for getting the final text output of the original audio file. This text will then be processed for getting summary.