How to Summarize Text Using Machine Learning Models

The techniques shown here have wide applications.
By Boris Delovski • Updated on May 8, 2023
blog image

 

 

Why You Should Care About Summarization

Automatic text summarization comprises a set of techniques that use algorithms to condense a large body of text, while at the same time preserving the important information included in the text. It is an area of computer automation that has seen steady development and improvement, although it does not get as much press as other machine learning achievements.

 

This is not to say that text summarization is of little importance, quite the contrary. A large amount of the information we create and exchange is in written form. Therefore, systems that can extract the core ideas from text and preserve the overall meaning stand to revolutionize entire industries, from health care to law, to finance, by allowing you to share information faster and more efficiently.

Automatic summarization as a field is not limited to text. In fact, you can "summarize" images and videos as well as text. Wikipedia defines automatic summarization as "the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content". The end goal, whether we summarize text, images or videos, is to reduce the amount of resources required to transmit and process data.

 

It's important to note that, here, by 'resources' we mean both computer resources involved in data processing, but also human cognitive resources required to parse and understand text. Humans, like machines, have a finite amount of data you can process per given unit of time and while you can't make your brain work faster (yet), you can certainly condense the information, essentially achieving a higher throughput of data processed per unit of time.

 

How to Automate Text Summarization

In this article, you will mostly focus on the most common type of automatic summarization: automatic text summarization.

In recent years, this area has become a particular point of interest due to the explosion of written content available online. Everything from tweets, to news articles, to blog posts includes text. This text may contain vital information for businesses, brands, financial asset traders, etc. but the amount of text generated far outpaces our ability to process it. Unless, of course, we can summarize it intelligently. This is where automatic text summarization comes in.

The positive side of the explosion of written content available online is that we now have more training data we can use to create advanced summarization models. In fact, while early summarization algorithms were never really very good, newer models based on deep-learning techniques and trained on vast amounts of data can produce impressive results.

 

What Are the Main Types of Text Summarization

If you exclude human assisted solutions, you generally have two main approaches to automatic text summarization:

 

  • Extractive summarization
  • Abstractive summarization

Let's focus on each one of these approaches and see some example code you can use.

Article continues below

 

How to Use Extractive Summarization

Extractive summarization algorithms perform a seemingly very simple task: they take in the original text document and extract parts of it that they deem important. This means that they do not create new data (new sentences). Instead, these models simply select parts of the original data which are most important and combine them to form a summary.

This is in contrast to how most humans summarize text. Instead of simply copying over the most important sentences, a well written summary authored by a human will contain new sentences which include just the most important points in the original text. Fortunately, there are models that work in a similar fashion, and I will cover them when I talk about abstractive summarization.

Now, back to extractive summarization. The key part of extractive summarization is determining which are the most important parts of a document. There are several ways we can approach this task, but they roughly fall in one of the following two categories:

 

  • Techniques that use text topics
  • Techniques that use sentence features

 

Topic-Based Summarization

Topic-based approaches create summaries by including only the sentences that are the most important for the topics covered by a document. To determine whether a sentence is important for a topic or not, topic-based summarization algorithms consider the average importance of the words inside the sentence.

Typically, you use a method like TF-IDF or similar to assign weights to each word in each sentence. These weights are closer to 0 for words not related to text topics, and are closer to 1 for words that are related to the important topics in the text. The importance of each sentence can then easily be calculated as the average importance of the words it contains. Once you have the average importance score of a sentence, you can compare it to some previously chosen threshold. If the importance is higher than the threshold, you include the sentence in the summary, otherwise you leave it out.

Approaches based on sentence features are a bit more complex. This is also where Machine Learning starts showing its potential. To a certain degree, you can treat text summarization as a binary classification problem where you need to categorize sentences into those that should go into the final text summary and those that shouldn't.

In this category of extractive summarization techniques, graph-based approaches are the most promising. One such example, the Text Rank algorithm, achieves great results on text summarization tasks. For this reason, you can choose the Text Rank algorithm as your extractive summarization algorithm of choice and will demonstrate how it solves a typical text summarization problem.

Before that, however, let's talk a bit about Text Rank and the ideas behind it.

 

How Does the Text Rank Algorithm Work?

Inspired by Google's Page Rank algorithm, Text Rank chooses important sentences by "ranking" all sentences in the text. After sentences are ranked, the top n ranked sentences are used to create a summary. One thing that is important to keep in mind here is that the rank of a sentence does not affect its position in the resulting summary. Instead, the order of the selected summary sentences in the original text is preserved.

The process of ranking sentences is fairly straightforward. You start by creating a graph in which every node represents a sentence from the original text you want to summarize. You then link each sentence in this graph to other similar sentences. These links are the edges of the resulting graph. In this graph, each sentence will point to other sentences that hold similar information.

Image Source: https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

The resulting edges of the graph are weighted. You then run a complex graph-based ranking formula over this weighted graph to determine the most important sentences in the original text and create the final summary. The math behind this formula is outside the scope of this article, but if you are curious, take a look at chapter 2.2. in this paper.

One thing to keep in mind is that there are various ways to define sentence similarity, and you must consider them carefully. The chosen definition of sentence similarity is the backbone of the Text Rank algorithm and therefore greatly affects the text summarization process. Whichever measure of similarity you choose in the end, it needs to be a function that represents how much two sentences overlap in terms of content.

And now, let's see extractive summarization using the Text Rank algorithm in action.

 

How to Use the Text Rank Algorithm for Extractive Text Summarization

You can use the implementation of TextRank available as a spaCy pipeline extension. spaCy is an excellent Python package for solving natural language processing tasks. You will also need pytextrank, which actually implements the TextRank algorithm as a spaCy extension.

So let's start by importing everything you need first:

import spacy
import pytextrank

Next, you'll define a typical spaCy pipeline, and afterward include the TextRank algorithm in the pipeline:

# Create spaCy pipeline and add textrank to it

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("textrank")

NOTE: If this is your first time using spaCy, you'll need to import the required spaCy models. In this case, you use the en_core_web_lg model, which you can install by running:

python -m spacy download en_core_web_lg

After you create the spaCy pipeline, let's create a sample text you want to summarize and save it in a variable:

example_text = """Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance. Artificial neural networks (ANNs) were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains. Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Early work showed that a linear perceptron cannot be a universal classifier, but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can. Deep learning is a modern variation which is concerned with an unbounded number of layers of bounded size, which permits practical application and optimized implementation, while retaining theoretical universality under mild conditions. In deep learning the layers are also permitted to be heterogeneous and to deviate widely from biologically informed connectionist models, for the sake of efficiency, trainability and understandability, whence the structured part."""

In the code cell above, you can scroll to the right to read the entire text, but to make it easier to read it, here is the text again:

Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance. Artificial neural networks (ANNs) were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains. Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Early work showed that a linear perceptron cannot be a universal classifier, but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can. Deep learning is a modern variation which is concerned with an unbounded number of layers of bounded size, which permits practical application and optimized implementation, while retaining theoretical universality under mild conditions. In deep learning the layers are also permitted to be heterogeneous and to deviate widely from biologically informed connectionist models, for the sake of efficiency, trainability and understandability, whence the structured part.

Ok, so now that everything is ready, you can process this example text data with spaCy and get the summary:

doc = nlp(example_text)

And that's pretty much it. The resulting doc object created by the spaCy pipeline contains many objects that have attributes and methods useful for natural language processing tasks. Once we create this doc object, you can easily access the summary sentences:

for sent in doc._.textrank.summary(limit_phrases=2, limit_sentences=2):
      print(sent)

In the code above, you use the summary method to access the summary sentences extracted by the TextRank algorithm. You also set the limit_phrases and limit_sentences parameters, to control the size of the summary. The resulting summary you'll get is the following:

Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.

Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue.

You can also take a look at the top 10 ranked phrases in the document:

phrases_and_ranks = [ 
    (phrase.chunks[0], phrase.rank) for phrase in doc._.phrases
]
phrases_and_ranks[:10]

These are the top 10 phrases you get:

[
    (deep neural networks, 0.10424066153650091), 
    (neural networks, 0.10090545971441124), 
    (Artificial neural networks, 0.0997136895890828), 
    (artificial neural networks, 0.0997136895890828), 
    (convolutional neural networks, 0.09753649303847897), 
    (recurrent neural networks, 0.09700653361002709), 
    (machine learning methods, 0.09661517111115507), 
    (deep structured learning, 0.09566339365301746), 
    (deep belief networks, 0.0918758372580297), 
    (human expert performance, 0.0912476607542944)
]

So it seems that you can get a pretty good results using the TextRank algorithm.

Extractive summarization methods, however, have a pretty significant short coming. Even though they produce a shortened version of the text we want to summarize, extractive summarization methods are limited by the fact that they do not really summarize the text in a classical sense. Instead of creating new text (new data) which summarizes the information contained in the original text, extractive summarization methods simply return a modified version of the original text, with some sentences that were deemed not important removed.

You can do better than this using abstractive summarization methods.

 

How to Use Abstractive Summarization

I've already discussed extractive summarization, which will simply try to identify the most important sentences in a given text document and use those in the text summary. I also showed some example code using TextRank, one of the more promising extractive summarization algorithms.

Next, I will talk about abstractive summarization.

Instead of just rewriting parts of the original text document, abstractive summarization methods mimic humans by creating completely new sentences to describe key concepts from the original text document. These new sentences can often use new words, not present in the original text, and aim to contain just the core information, with everything unimportant removed.

Abstractive summarization techniques have a more human-like approach to text summarization, so there is no surprise that they primarily rely on deep learning models. While initially these models used RNN (recurrent neural networks) based architectures, as of recent, the models that have taken over the world of natural language processing are the so-called Transformers.

 

What Are Transformers in Abstractive Summarization?

Transformers are a family of architectures built to transform an input sequence into an output sequence, by using a special encoder-decoder architecture. The special thing about transformers is the inclusion of a "self-attention" function and a few other modifications such as positional encoding. We won't go into too much detail about transformers, as it is a topic that would require a dedicated article to cover in depth. For technical details, you can check out this paper.

For now, it is enough for you to know that transformers can, given some input text, generate completely new text. In the case of abstractive summarization, transformers take the original text as the input and generate the summary text as the output.

Text summaries created using transformers are usually of high quality, and because they are generated by transformer models from scratch, include original sentences. The fascinating thing is that the entire procedure - reading the original text document, paying attention to certain things, and in the end generating new text - pretty much mimics how you create summaries of text documents.

Transformers have a wide array of applications and I won't look into them too much here. It is important to know, however, that not all transformers are meant to be used for text summarization.

In this article, I want to draw your attention to a newly release model, called PEGASUS, which seems to be near the very top in terms of the quality of outputs when it comes to text summarization.

 

How Does the PEGASUS Model Work?

PEGASUS is similar to other transformer models. The main difference comes from a special technique used during the model pre-training. During PEGASUS pre-training, the most important sentences in the training text corpora are masked (hidden from the model). The model is trained to generate these sentences as one output sequence.

It turns out that this ability to generate important sentences from a given text is very close to what what is needed for successful abstractive summarization. The creators of the model trained the model on a very large number of web pages and news articles. However, the model can be further fine-tuned on small datasets (your own text data) and achieve very good results on domain-specific text.

Aside from achieving state-of-the-art results, PEGASUS and other transformers models are very practical to use. Not only can they be fine-tuned for better results on domain-specific text, if needed, but can also achieve great results out-of-the-box, with no fine-tuning at all.

 

How to Use the PEGASUS Model for Abstractive Text Summarization

Fortunately for you, the HuggingFace community offers packages that make it very easy to work with transformers. Here you can use the transformers package offered by the HuggingFace community. If you already have Tensorflow 2.0 installed, installing the transformers package is straightforward:

pip install transformers

or, if you are using Anaconda or Miniconda:

conda install -c huggingface transformers

You can read more about the installation process here. You are going to start, as usual, by importing all the necessary packages and modules:

from transformers import PegasusForConditionalGeneration
from transformers import PegasusTokenizer
from transformers import pipeline

Next, let's define the model we plan on using and load the pretrained tokenizer. You can use the pegasus-xsum model, but you can pick any PEGASUS model from the HuggingFace library:

# Pick model
model_name = "google/pegasus-xsum"

# Load pretrained tokenizer
pegasus_tokenizer = PegasusTokenizer.from_pretrained(model_name)

You can again create a variable to store the text you want to summarize:

example_text = """Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance. Artificial neural networks (ANNs) were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains. Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Early work showed that a linear perceptron cannot be a universal classifier, but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can. Deep learning is a modern variation which is concerned with an unbounded number of layers of bounded size, which permits practical application and optimized implementation, while retaining theoretical universality under mild conditions. In deep learning the layers are also permitted to be heterogeneous and to deviate widely from biologically informed connectionist models, for the sake of efficiency, trainability and understandability, whence the structured part."""

In the code cell above, you can scroll to the right to read the entire text, but to make it easier to read it, here is the text again:

Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance. Artificial neural networks (ANNs) were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains. Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Early work showed that a linear perceptron cannot be a universal classifier, but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can. Deep learning is a modern variation which is concerned with an unbounded number of layers of bounded size, which permits practical application and optimized implementation, while retaining theoretical universality under mild conditions. In deep learning the layers are also permitted to be heterogeneous and to deviate widely from biologically informed connectionist models, for the sake of efficiency, trainability and understandability, whence the structured part.

Now you can create your model and generate the tokens the model will use:

# Define PEGASUS model
pegasus_model = PegasusForConditionalGeneration.from_pretrained(model_name)

# Create tokens
tokens = pegasus_tokenizer(example_text, truncation=True, padding="longest", return_tensors="pt")

When you create your text summarization, it will be encoded because you are working with tokens. The full procedure is to first create the summary, and then decode the summary:

# Summarize text
encoded_summary = pegasus_model.generate(**tokens)

# Decode summarized text
decoded_summary = pegasus_tokenizer.decode(
      encoded_summary[0],
      skip_special_tokens=True
)

The final summary is available in the decoded_summary variable. If you print it out, you get the following:

Deep learning is a branch of computer science that deals with the study and training of machine learning.

This is a very short summary. If you want or need to, you can customize the length of the summary. To do so, let's start by defining a summarization pipeline:

# Define summarization pipeline 
summarizer = pipeline(
    "summarization", 
    model=model_name, 
    tokenizer=pegasus_tokenizer, 
    framework="pt"
)

In the code above, the model_name and pegasus_tokenizer are the same variables created above. Next, you create the summary by using this summarization pipeline.

This time, you will specify a minimum and maximum length for the summary:

# Create summary 
summary = summarizer(example_text, min_length=30, max_length=150)

And finally, you can check out the text summary produced:

summary[0]["summary_text"]

Using the sample text and code above, the new summary you get is:

Deep learning is a branch of computer science which deals with the study and training of complex systems such as speech recognition, natural language processing, machine translation and medical image analysis. Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.

This is quite an improvement over the summary produced using the TextRank algorithm.

 

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.