How to Summarize Text using Machine Learning Models

In this article, we introduce a few ways in which we can use recent advances in natural language processing and deep learning to summarize text. The techniques shown here have wide applications - from automatically extracting meaning from user reviews, to legal contract analysis, optimizing SEO strategies, financial analysis, extracting important information from electronic medical records, etc.
By Boris Delovski • Sep 29, 2021

Why we care about summarization

Automatic text summarization comprises a set of techniques that use algorithms to condense a large body of text, while at the same time preserving the important information included in the text. It is an area of computer automation that has seen steady development and improvement, although it does not get as much press as other machine learning achievements.

This is not to say that text summarization is of little importance, quite the contrary. A large amount of the information we create and exchange is in written form. Therefore, systems that can extract the core ideas from text and preserve the overall meaning stand to revolutionize entire industries, from health care to law, to finance, by allowing us to share information faster and more efficiently.

Automatic summarization as a field is not limited to text. In fact, we can 'summarize' images and videos as well as text. Wikipedia defines automatic summarization as "the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content". The end goal, whether we summarize text, images or videos, is to reduce the amount of resources required to transmit and process data.

It's important to note that, here, by 'resources' we mean both computer resources involved in data processing, but also human cognitive resources required to parse and understand text. Humans, like machines, have a finite amount of data we can process per given unit of time and while we cannot make our brains work faster (yet), we can certainly condense the information, essentially achieving a higher throughput of data processed per unit of time.

 

Automatic text summarization

In this article, we will mostly focus on the most common type of automatic summarization: automatic text summarization.

In recent years, this area has become a particular point of interest due to the explosion of written content available online. Everything from tweets, to news articles, to blog posts includes text. This text may contain vital information for businesses, brands, financial asset traders, etc. but the amount of text generated far outpaces our ability to process it. Unless, of course, we can summarize it intelligently. This is where automatic text summarization comes in.

The positive side of the explosion of written content available online is that we now have more training data we can use to create advanced summarization models. In fact, while early summarization algorithms were never really very good, newer models based on deep-learning techniques and trained on vast amounts of data can produce impressive results.

 

The main types of text summarization

If we exclude human assisted solutions, we generally have two main approaches to automatic text summarization:

  • Extractive summarization
  • Abstractive summarization

Let's focus on each one of these approaches and see some example code we can use.

 

Extractive summarization

Extractive summarization algorithms perform a seemingly very simple task: they take in the original text document and extract parts of it that they deem important. This means that they do not create new data (new sentences). Instead, these models simply select parts of the original data which are most important and combine them to form a summary.

This is in contrast to how most humans summarize text. Instead of simply copying over the most important sentences, a well written summary authored by a human will contain new sentences which include just the most important points in the original text. Fortunately, we do have models that work in a similar fashion, and we will cover them when we talk about abstractive summarization.

Now, back to extractive summarization. The key part of extractive summarization is determining which are the most important parts of a document. There are several ways we can approach this task, but they roughly fall in one of the following two categories:

  • techniques that use text topics
  • techniques that use sentence features

 

Topic-based summarization

Topic-based approaches create summaries by including only the sentences that are the most important for the topics covered by a document. To determine whether a sentence is important for a topic or not, topic-based summarization algorithms consider the average importance of the words inside the sentence.

Typically, we use a method like TF-IDF or similar to assign weights to each word in each sentence. These weights are closer to 0 for words not related to text topics, and are closer to 1 for words that are related to the important topics in the text. The importance of each sentence can then easily be calculated as the average importance of the words it contains. Once we have the average importance score of a sentence, we can compare it to some previously chosen threshold. If the importance is higher than the threshold, we will include the sentence in the summary, otherwise we will leave it out.

Approaches based on sentence features are a bit more complex. This is also where Machine Learning starts showing its potential. To a certain degree, we can treat text summarization as a binary classification problem where we need to categorize sentences into those that should go into the final text summary and those that shouldn't.

In this category of extractive summarization techniques, graph-based approaches are the most promising. One such example, the Text Rank algorithm, achieves great results on text summarization tasks. For this reason, we chose the Text Rank algorithm as our extractive summarization algorithm of choice and will demonstrate how it solves a typical text summarization problem.

Before that, however, let's talk a bit about Text Rank and the ideas behind it.

 

How does the Text Rank algorithm work?

Inspired by Google's Page Rank algorithm, Text Rank chooses important sentences by "ranking" all sentences in the text. After sentences are ranked, the top n ranked sentences are used to create a summary. One thing that is important to keep in mind here is that the rank of a sentence does not affect its position in the resulting summary. Instead, the order of the selected summary sentences in the original text is preserved.

The process of ranking sentences is fairly straightforward. We start by creating a graph in which every node represents a sentence from the original text we want to summarize. We then link each sentence in this graph to other similar sentences. These links are the edges of the resulting graph. In this graph, each sentence will point to other sentences that hold similar information.

source: https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

The resulting edges of the graph are weighted. We then run a complex graph-based ranking formula over this weighted graph to determine the most important sentences in the original text and create the final summary. The math behind this formula is outside the scope of this article, but if you are curious, take a look at chapter 2.2. in this paper.

One thing to keep in mind is that there are various ways to define sentence similarity, and you must consider them carefully. The chosen definition of sentence similarity is the backbone of the Text Rank algorithm and therefore greatly affects the text summarization process. Whichever measure of similarity you choose in the end, it needs to be a function that represents how much two sentences overlap in terms of content.

And now, let's see extractive summarization using the Text Rank algorithm in action.

 

How to use the Text Rank algorithm for extractive text summarization

We will use the implementation of TextRank available as a spaCy pipeline extension. spaCy is an excellent Python package for solving natural language processing tasks. We will also need pytextrank, which actually implements the TextRank algorithm as a spaCy extension.

So let's start by importing everything we need first:

import spacy
import pytextrank

Next, we'll define a typical spaCy pipeline, and afterward include the TextRank algorithm in the pipeline:

# Create spaCy pipeline and add textrank to it

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("textrank")

NOTE: If this is your first time using spaCy, you'll need to import the required spaCy models. In this case, we use the en_core_web_lg model, which you can install by running:

 

python -m spacy download en_core_web_lg

 

After we create the spaCy pipeline, let's create a sample text we want to summarize and save it in a variable:

example_text = """Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance. Artificial neural networks (ANNs) were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains. Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Early work showed that a linear perceptron cannot be a universal classifier, but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can. Deep learning is a modern variation which is concerned with an unbounded number of layers of bounded size, which permits practical application and optimized implementation, while retaining theoretical universality under mild conditions. In deep learning the layers are also permitted to be heterogeneous and to deviate widely from biologically informed connectionist models, for the sake of efficiency, trainability and understandability, whence the structured part."""

In the code cell above, you can scroll to the right to read the entire text, but to make it easier to read it, here is the text again:

 

Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance. Artificial neural networks (ANNs) were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains. Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Early work showed that a linear perceptron cannot be a universal classifier, but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can. Deep learning is a modern variation which is concerned with an unbounded number of layers of bounded size, which permits practical application and optimized implementation, while retaining theoretical universality under mild conditions. In deep learning the layers are also permitted to be heterogeneous and to deviate widely from biologically informed connectionist models, for the sake of efficiency, trainability and understandability, whence the structured part.

 

Ok, so now that everything is ready, we can process our example text data with spaCy and get the summary:

doc = nlp(example_text)

And that's pretty much it. The resulting doc object created by the spaCy pipeline contains many objects that have attributes and methods useful for natural language processing tasks. Once we create this doc object, we can easily access the summary sentences:

for sent in doc._.textrank.summary(limit_phrases=2, limit_sentences=2):
      print(sent)

In the code above, we will use the summary method to access the summary sentences extracted by the TextRank algorithm. We will also set the limit_phrases and limit_sentences parameters, to control the size of the summary. The resulting summary we get is the following:

 

Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.

Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue.

 

We can also take a look at the top 10 ranked phrases in the document:

phrases_and_ranks = [ 
    (phrase.chunks[0], phrase.rank) for phrase in doc._.phrases
]
phrases_and_ranks[:10]

And these are the top 10 phrases we get:

[
    (deep neural networks, 0.10424066153650091), 
    (neural networks, 0.10090545971441124), 
    (Artificial neural networks, 0.0997136895890828), 
    (artificial neural networks, 0.0997136895890828), 
    (convolutional neural networks, 0.09753649303847897), 
    (recurrent neural networks, 0.09700653361002709), 
    (machine learning methods, 0.09661517111115507), 
    (deep structured learning, 0.09566339365301746), 
    (deep belief networks, 0.0918758372580297), 
    (human expert performance, 0.0912476607542944)
]

So it seems that we get pretty good results using the TextRank algorithm.

Extractive summarization methods, however, have a pretty significant short coming. Even though they produce a shortened version of the text we want to summarize, extractive summarization methods are limited by the fact that they do not really summarize the text in a classical sense. Instead of creating new text (new data) which summarizes the information contained in the original text, extractive summarization methods simply return a modified version of the original text, with some sentences that were deemed not important removed.

We can do better than this using abstractive summarization methods.

 

Abstractive summarization

We already discussed extractive summarization, which will simply try to identify the most important sentences in a given text document and use those in the text summary. We also showed some example code using TextRank, one of the more promising extractive summarization algorithms.

Next, we will talk about abstractive summarization.

Instead of just rewriting parts of the original text document, abstractive summarization methods mimic humans by creating completely new sentences to describe key concepts from the original text document. These new sentences can often use new words, not present in the original text, and aim to contain just the core information, with everything unimportant removed.

Abstractive summarization techniques have a more human-like approach to text summarization, so there is no surprise that they primarily rely on deep learning models. While initially these models used RNN (recurrent neural networks) based architectures, as of recent, the models that have taken over the world of natural language processing are the so-called Transformers.

 

Transformers

Transformers are a family of architectures built to transform an input sequence into an output sequence, by using a special encoder-decoder architecture. The special thing about transformers is the inclusion of a "self-attention" function and a few other modifications such as positional encoding. We won't go into too much detail about transformers, as it is a topic that would require a dedicated article to cover in depth. For technical details, you can check out this paper.

For now, it is enough for you to know that transformers can, given some input text, generate completely new text. In the case of abstractive summarization, transformers take the original text as the input and generate the summary text as the output.

Text summaries created using transformers are usually of high quality, and because they are generated by transformer models from scratch, include original sentences. The fascinating thing is that the entire procedure - reading the original text document, paying attention to certain things, and in the end generating new text - pretty much mimics how we, humans, create summaries of text documents.

Transformers have a wide array of applications and we won't look into them too much here. It is important to know, however, that not all transformers are meant to be used for text summarization.

In this article, I want to draw your attention to a newly release model, called PEGASUS, which seems to be near the very top in terms of the quality of outputs when it comes to text summarization.

 

How does the PEGASUS model work?

PEGASUS is similar to other transformer models. The main difference comes from a special technique used during the model pre-training. During PEGASUS pre-training, the most important sentences in the training text corpora are masked (hidden from the model). The model is trained to generate these sentences as one output sequence.

It turns out that this ability to generate important sentences from a given text is very close to what what is needed for successful abstractive summarization. The creators of the model trained the model on a very large number of web pages and news articles. However, the model can be further fine-tuned on small datasets (our own text data) and achieve very good results on domain-specific text.

Aside from achieving state-of-the-art results, PEGASUS and other transformers models are very practical to use. Not only can they be fine-tuned for better results on domain-specific text, if needed, but can also achieve great results out-of-the-box, with no fine-tuning at all.

 

How to use the PEGASUS model for abstractive text summarization

Fortunately for us, the HuggingFace community offers packages that make it very easy to work with transformers. Here we will use the transformers package offered by the HuggingFace community. If you already have Tensorflow 2.0 installed, installing the transformers package is straightforward:

 

pip install transformers

 

or, if you are using Anaconda or Miniconda:

 

conda install -c huggingface transformers

 

You can read more about the installation process here. We are going to start, as usual, by importing all the necessary packages and modules:

from transformers import PegasusForConditionalGeneration
from transformers import PegasusTokenizer
from transformers import pipeline

Next, let's define the model we plan on using and load the pretrained tokenizer. We will use the pegasus-xsum model, but you can pick any PEGASUS model from the HuggingFace library:

# Pick model
model_name = "google/pegasus-xsum"

# Load pretrained tokenizer
pegasus_tokenizer = PegasusTokenizer.from_pretrained(model_name)

We will again create a variable to store the text we want to summarize:

example_text = """Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance. Artificial neural networks (ANNs) were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains. Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Early work showed that a linear perceptron cannot be a universal classifier, but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can. Deep learning is a modern variation which is concerned with an unbounded number of layers of bounded size, which permits practical application and optimized implementation, while retaining theoretical universality under mild conditions. In deep learning the layers are also permitted to be heterogeneous and to deviate widely from biologically informed connectionist models, for the sake of efficiency, trainability and understandability, whence the structured part."""

In the code cell above, you can scroll to the right to read the entire text, but to make it easier to read it, here is the text again:

 

Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance. Artificial neural networks (ANNs) were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains. Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Early work showed that a linear perceptron cannot be a universal classifier, but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can. Deep learning is a modern variation which is concerned with an unbounded number of layers of bounded size, which permits practical application and optimized implementation, while retaining theoretical universality under mild conditions. In deep learning the layers are also permitted to be heterogeneous and to deviate widely from biologically informed connectionist models, for the sake of efficiency, trainability and understandability, whence the structured part.

 

Now we can create our model and generate the tokens the model will use:

# Define PEGASUS model
pegasus_model = PegasusForConditionalGeneration.from_pretrained(model_name)

# Create tokens
tokens = pegasus_tokenizer(example_text, truncation=True, padding="longest", return_tensors="pt")

When we create our text summarization, it will be encoded because we are working with tokens. The full procedure is to first create the summary, and then decode the summary.

# Summarize text
encoded_summary = pegasus_model.generate(**tokens)

# Decode summarized text
decoded_summary = pegasus_tokenizer.decode(
      encoded_summary[0],
      skip_special_tokens=True
)

The final summary is available in the decoded_summary variable. If we print it out, we get the following:

 

Deep learning is a branch of computer science that deals with the study and training of machine learning.

 

This is a very short summary. If we want or need to, we can customize the length of the summary. To do so, let's start by defining a summarization pipeline:

# Define summarization pipeline 
summarizer = pipeline(
    "summarization", 
    model=model_name, 
    tokenizer=pegasus_tokenizer, 
    framework="pt"
)

In the code above, the model_name and pegasus_tokenizer are the same variables we created above. Next, we'll create the summary by using this summarization pipeline. This time, we will specify a minimum and maximum length for the summary:

# Create summary 
summary = summarizer(example_text, min_length=30, max_length=150)

And finally, we can check out the text summary produced:

summary[0]["summary_text"]

Using the sample text and code above, the new summary we get is:

 

Deep learning is a branch of computer science which deals with the study and training of complex systems such as speech recognition, natural language processing, machine translation and medical image analysis. Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.

 

This is quite an improvement over the summary produced using the TextRank algorithm.

 

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.