How to Process Text Data with NLTK and Python

A guide to text processing using the Natural Language Toolkit Library.
By Boris Delovski • Updated on Jul 13, 2023
blog image

Tools such as ChatGPT have made it incredibly easy for anyone to utilize powerful and complex Deep Learning models for text data analysis, without the need to understand natural language processing (NLP) as a field. However, if your goal is to dive deeper into NLP, it is best to start with the fundamentals before moving on to the deep learning models. The NLTK library, one of the most popular tools for NLP, serves as an excellent starting point and introduction to the field for beginners.

 

 

What Is NLTK?

NLTK, short for Natural Language Toolkit, is widely recognized as the most influential and robust Python library for text data processing. It was created more than two decades ago, and offers users a comprehensive set of tools for seamless analysis and manipulation of human language data. Due to its extensive functionality, it remains a staple in the toolkit of most professionals and students doing work in natural language processing. In fact, it frequently serves as a gateway to essential concepts in the field, making it the go-to library for students learning natural language processing as well.

 

What Are the Advantages of NLTK?

There are many advantages to NLTK, hence its popularity. The main advantages of NLTK are:

 

  • versatility
  • compatibility
  • availability

NLTK is a highly versatile library capable of handling various aspects of text analysis. This makes it a valuable tool for working with language data. It covers a wide range of NLP tasks such as:

 

  • tokenization
  • parsing
  • part-of-speech (POS) tagging
  • stemming
  • lemmatization

NLTK seamlessly integrates with other popular data processing and machine learning libraries, including NumPy, Pandas and Scikit-Learn. This compatibility allows users to leverage NLTK's capabilities within their existing workflows.

 

It is worth mentioning that one of the main reasons why NLTK is so popular is the fact that it is an open-source library. Its code is freely available to anyone who might want to use it, modify it and even distribute it. This is very important, because it means that anyone who wants to use NLTK can do so, regardless of their budget. It also means that the library will always be regularly updated, as its open-source nature allows a large community of developers to contribute to its development.

Finally, NLTK is accompanied by an excellent book called "Natural Language Processing with Python", also known as the "NLTK Book", which was created by some of the main contributors to the NLTK project. This free book is widely considered to be one of the best books for beginners on the topic of NLP, so I recommend you check it out.

 

What Are Some Disadvantages of NLTK?

Nonetheless, there are also some disadvantages to NLTK. The most common problems developers have with NLTK are:

 

  • its relative inefficiency compared to other libraries
  • its relative complexity, and
  • the fact that it is a bit dated

NLTK is a very popular and powerful tool, though it is not the most efficient one. While it is one of the best choices if you are just starting out in the field of NLP, it may not be the best for optimizing text processing pipeline. For more advanced NLP use-cases, libraries like SpaCy are a better choice, as they can do the same tasks much faster.

The versatility of NLTK can also be a drawback. While The NLTK Book is a good introduction, the library can still sometimes be hard for beginners to navigate.

Finally, NLTK doesn't include the most advanced deep learning models, which often outperform older methods by a large margin. For example, models such as those that are based on the Transformer architecture are included in other libraries, like SpaCy, but are not included in NLTK. This makes NLTK somewhat dated in comparison to those other libraries.

 

 

Article continues below

Basics of Processing Natural Language

Processing natural language includes not only processing text data, but also processing spoken and signed languages. In this article, we will focus on explaining the basics of text data processing.

Text data is generally processed by breaking it down into its basic characteristics. There are many different types of characteristics we could analyze, but the three most important types of characteristics are:

 

  • lexical characteristics
  • syntactic characteristics
  • morphological characteristics

Lexical characteristics are extracted by analyzing the words themselves, alongside how they are used. This includes things like word meanings, similar and opposite words, and common phrases. The analysis also looks at how words are chosen and used in various situations.

Syntactic characteristics refer to the rules and patterns governing the arrangement and combination of words to form grammatically correct sentences. This includes the study of sentence structure, word order, parts of speech, and the relationships between different elements in a sentence.

Morphological characteristics refer to the internal structure and formation of words. Put simply, we break down words into their root word, and the various prefixes and suffixes that we add to that root word. Performing a morphological analysis also includes analyzing how words change to show things like tense or plural form.

Let's demonstrate how we can extract these three types of characteristics using NLTK.

 

How to Perform Lexical Analysis Using NLTK

Performing a lexical analysis is essentially segmenting a text into lexical expressions. In general, the process of separating text into elements that hold some meaning is called tokenization. Tokens are most often one of the following:

 

  • words
  • numbers
  • punctuation marks

However, we can also use multiple words to define a token. This is something we actually do very often, because different combinations of words hold different meanings, even though they consist of the same words. 

For example, even though the phrases "boat metal" and "metal boat" represent completely different things, they consist of the same words. For this reason, it is sometimes advisable to consider combinations of words as tokens, instead of treating each word as a separate token. Separating text data into tokens is a task we can achieve very easily with NLTK. The exact functions and methods we will use vary depending on what we want our tokens to be.

 

How to Separate Text Into Words

To separate text into tokens, where each one of those tokens is a single word, we can use the word_tokenize() function. We just need to give the function our text as input, and it will return a tokenized version of the text. Let's demonstrate.

First, I am going to import the function from NLTK.

# Import what we need
from nltk.tokenize import word_tokenize

Then, I will create some text data.

# Create example text data
text = "The dog sits on the porch."

Finally, let's use the word_tokenize() function to separate our text into words.

# Perform tokenization
tokenized_text = word_tokenize(text)

The result of separating our text using the word_tokenize() function will look like this:

['The', 'dog', 'sits', 'on', 'the', 'porch', '.']

If you take a look at the result, you will notice that we successfully split our text into a list of words. However, the word_tokenize() function also separated the punctuation mark as a separate token. In practice, we would remove punctuation marks and special symbols in the data cleaning phase, so this wouldn't be an issue. However, if we wanted to more precisely define what a token is, we can use the RegexpTokenizer.

NLTK uses RegEx in the background to perform tokenization. It has a RegEx formula it uses by default, but you can also use the special RegexpTokenizer class to tokenize data. With RegexpTokenizer, you can provide your own regular expression to customize the way text is tokenized. Let's demonstrate how we can tokenize our sentence, making sure that the punctuation mark will not be treated as a token.

 

First, I will import the tokenizer:

# Import the tokenizer
from nltk.tokenize import RegexpTokenizer

Next, I will create the tokenizer, defining the equation it is going to use to recognize what a token is.

# Define the tokenizer parameters
tokenizer = RegexpTokenizer("[\w']+")

Finally, I can tokenize my text using the RegexpTokenizer that I just created.

 # Tokenize text data
tokenized_text = tokenizer.tokenize(text)

After tokenization, my result is going to look like this:

['The', 'dog', 'sits', 'on', 'the', 'porch']

As you can see, we successfully separated our text into a list of words.

 

How to Separate Text Into Sentences

Before analyzing each sentence separately, we often need to separate a large chunk of text into sentences. The way we do this is very similar to how we separate text into words. The key distinction is that, in this case, we will use the sent_tokenize() function to separate our text into sentences. Let's demonstrate.

First, I am going to import the function.

# Import what we need
from nltk.tokenize import sent_tokenize

Then, I am going to define some example text data. In this case, it is going to be a multiline string that consists of three sentences.

# Define example text data
sentences = """Machine Learning can usually be divided into 
                          classic Machine Learning and Deep Learning. 
                          Classic Machine Learning is relatively easy to understand. 
                          Deep Learning on the other hand is a bit more complex."""

After defining some example text data, I can use the sent_tokenize() function to separate my text into sentences.

# Tokenize text data
tokenized_sentences = sent_tokenize(sentences)

After tokenization, my result will be a list that looks like this:

['Machine Learning can usually be divided into classic Machine Learning and Deep Learning.', 'Classic Machine Learning is relatively easy to understand.', 'Deep Learning on the other hand is a bit more complex.']

As you can see, we successfully separated our original text, which consisted of three sentences, into a list of three members where each member is one of the sentences.

 

How to Separate Text Into Phrases

Grasping grammar or context from words alone can be hard. Like I mentioned earlier, sometimes it is a much better idea to use phrases instead of single words as tokens. In the field of NLP, we use the phrase n-gram to describe how many words a token consists of. So if our tokens are single words, we say that we are working with unigrams. If they consist of two words, we say that we are working with bigrams, and so on.

To separate text data into n-grams, we must first tokenize it into words. To demonstrate, let's use the sentence we already separated into words earlier, the one stored inside the tokenized_text variable. To separate it into, for example, bigrams, I first need to import the ngrams function from NTLK:

# Import what we need
from nltk import ngrams

Next, I need to define what my n is going to be. Because I want to separate the data into bigrams, phrases of two words, I will define my n as 2. Finally, I can separate my text into bigrams.

# Define n
n = 2

# Separate text into bigrams
bigrams = ngrams(tokenized_text, n)

Note that, unlike in the previous examples, where the functions returned a list as a result, the ngrams function returns a generator. Using a generator for generating n-grams provides many benefits, such as memory efficiency, streaming capability, lazy evaluation, and flexibility. However, it also means that I will need to extract the bigrams from the generator before I can use them. This can be done using a simple for loop: 

# Create an empty list where we will store the bigrams
bigram_list = []

# Store bigrams inside the list
for phrase in bigrams: 
    bigram_list.append(phrase)

Now, if we check what is currently stored inside the list, we will get the following result:

[('The', 'dog'), ('dog', 'sits'), ('sits', 'on'), ('on', 'the'), ('the', 'porch')]

As you can see, we successfully separated our sentence into a list of tuples, where each tuple represents a phrase that consists of two words from our sentence.

 

How to Perform Syntactical Analysis Using NLTK

Performing a syntactical analysis using NLTK can include many things, but the most common procedure we perform is part-of-speech (POS) tagging. POS tagging is the process of assigning grammatical tags to each word in a given sentence. The tags typically represent the syntactic category of a word, such as noun, verb, adjective or adverb. POS tagging plays a vital role in various text-processing tasks, because it enables our models to better understand how words inside a sentence interact with each other. 

In the background, the current default tagging algorithm in NLTK for POS tagging is the averaged perceptron tagger. This algorithm is an extension of the standard perceptron tagger, which is a machine learning-based tagger that utilizes a linear classifier to predict part-of-speech tags for words. As an extension, it improves on the previous version by adding an averaging step. This helps improve the overall performance and reduce overfitting. Let's demonstrate how we can use the POS tagger to tag our text data.

To use the tagger to POS tag our text data, we first need to import the pos_tag() function from NLTK.

#Import the tagger from NLTK
from nltk import pos_tag

Now, let's create some example text data. The tagger from NLTK expects a list of strings as input. In practice, this means that you always need to do tokenization before POS tagging.  To demonstrate, I will create an example list of strings to use as input to the tagger.

# Create an example list of strings
# that represents a tokenized sentence
text_to_tag = ['Life', 'is', 'what', 'happens', 'when', 'you', 'are', 'busy', 'making', 'other', 'plans']

Finally, let's perform some POS tagging. To do so, we just need to enter our list of strings as input to the tagger from NLTK:

# Perform POS-tagging
tagged_text = pos_tag(text_to_tag)

Let's take a look at the result.

[('Life', 'NNP'), ('is', 'VBZ'), ('what', 'WP'), ('happens', 'VBZ'), ('when', 'WRB'), ('you', 'PRP'), ('are', 'VBP'), ('busy', 'JJ'), ('making', 'VBG'), ('other', 'JJ'), ('plans', 'NNS')]

As you can see, the POS tags are fairly complex. You can find their full explanation by looking into the documentation of NLTK. For example, the tag 'VBZ' says that the word is a verb in the present tense, third-person singular form. These tags are extremely important, especially for the task of lemmatization, which is a frequent task in morphological analysis.

 

How to Perform Morphological Analysis Using NLTK

Morphological analysis primarily focuses on analyzing the internal structure of words. A morpheme is the smallest element with independent meaning inside a word. A great example of a word that is itself a morpheme is the word "run". On the other hand, the word "runner" consist of the root word "run" and the suffix "er".  

If you are using a deep learning model, then the distinction between the root word and the words that we get by adding prefixes or suffixes is not that important, because the data for these models is typically encoded in the form of word embeddings. However, when working with standard algorithms and even classic machine learning models, the suffixes and prefixes are not only useless but can even decrease the performance of these algorithms. In those cases, we must do one of the two following tasks before feeding the data into the model:

 

  • stemming
  • lemmatization

Both of these tasks aim to do the same thing, with slight differences in execution. Let's explain how we each of them works.

 

How to Do Stemming

Stemming is the process of reducing some word to its root - that is, to its base morpheme. For example, the word "working" would be reduced to the word "work", and the word "likely" to the word "like". As I mentioned earlier, we do stemming because, when using certain machine learning models, the ending of a word is usually not important to its overall meaning. Incidentally, it also makes sure our model will be somewhat robust to spelling errors.

When stemming our data, we need to be careful not to run into the following:

 

  • over stemming - when we remove more from a word than is needed to get to its morpheme
  • under stemming - when we don't reduce a word to its morpheme because we didn't remove enough

In NLTK there are multiple popular algorithms we can use to perform stemming. The two most popular algorithms are:

 

  • PorterStemmer
  • SnowballStemmer

The PorterStemmer is probably the most well-known algorithm for stemming. It is also very fast and efficient. However, it is limited to the English language. Also, it can be a bit lenient in its stemming decisions, which means that it might stem words to a form that doesn't actually exist in the English language. Counterintuitively, this is usually not a big problem, because the stems are still linguistically related.

The SnowballStemmer, sometimes also called the Porter2Stemmer, is a more advanced algorithm. It is not limited to the English language, and it is also much more strict when stemming. This means that it actively tries to avoid creating stems that are not actual words. Unfortunately, it isn't always successful in doing so, and it will often produce similar results to the SnowballStemmer. What is more, the advantages of the SnowballStemmer are somewhat offset by the fact that it is slower than the PorterStemmer algorithm, making it a less-than-ideal choice for large-scale text processing.

The choice of stemmer depends on the problem you are working on. To make sure that you can tackle any problem you run into, let's demonstrate how to use both of these stemmers.

First, let's import the two stemmers:

# Import the stemmers 
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

Next, I need to define the two stemmers. When defining the PorterStemmer we can just create an instance of the PorterStemmer class, but when defining the SnowballStemmer we need to specify the language our text data will be in.

# Define the stemmers we will use
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer(language='english') 

The stemmers from NLTK take input strings that represent single words. This means that, if we want to stem a sentence, we need to first tokenize that sentence into a list of strings, then loop over that list and stem each word separately. Let's stem the words inside the text_to_tag list of words, the list we used earlier to demonstrate POS tagging.

First, I will demonstrate how the PorterStemmer works:

# Create an empty list we will populate with words 
# stemmed with the PorterStemmer algorithm
stemmed_text_porter_stemmer = []

# Stem the words using the PorterStemmer algorithm
for word in text_to_tag
    stemmed_word = porter_stemmer.stem(word)
    stemmed_text_porter_stemmer.append(stemmed_word)

The result we get after stemming the words with the PorterStemmer algorithm looks like this:

['life', 'is', 'what', 'happen', 'when', 'you', 'are', 'busi', 'make', 'other', 'plan']

Let's compare that to what we get when we use the SnowballStemmer algorithm:

# Create an empty list we will populate with words 
# stemmed with the SnowballStemmer algorithm
stemmed_text_snowball_stemmer = []

# Stem the words using the SnowballStemmer algorithm
for word in text_to_tag:
    stemmed_word = snowball_stemmer.stem(word)
    stemmed_text_snowball_stemmer.append(stemmed_word)

The result we get after stemming the words with the SnowballStemmer algorithm looks like this:

['life', 'is', 'what', 'happen', 'when', 'you', 'are', 'busi', 'making', 'other', 'plan']

In this case, the results we got using the two different stemmers are identical. As you might have noticed, both of the stemmers even lowercase the words before stemming them, something that is common practice in text preprocessing. This is to avoid having algorithms treat uppercase and lowercase versions of the same word as two different words. However, both of the stemmers also stem the word "busy" to "busi" - that is, they reduce it to a word that doesn't exist in the English language. If we want to make sure that we don't run into this situation, we would need to do lemmatization instead of stemming.

 

How to Do Lemmatization

NLTK's lemmatization algorithm utilizes the WordNet lexical database, which is a comprehensive resource for the English language. WordNet organizes words into synsets, or groups of synonyms, and offers valuable information about their semantic connections.

The way lemmatization works is actually quite simple. Instead of removing parts from words, it will take a look at the word and its POS tag and, based on that information, pick a version of that word from the WordNet dictionary. These dictionary word forms are called lemmas. In practice, this means that, as long as we perform POS tagging before lemmatization, the results we get from lemmatization will be better than the results we'd get from stemming. Theoretically, we can do lemmatization without performing POS tagging, though this is not recommended. If we don't supply the lemmatizer with a POS tag for a word, its default behavior is to treat the word as a noun. This means that lemmatizing a word without entering its POS tag will often lead to a word being lemmatized incorrectly. Let's demonstrate how we lemmatize text in NLTK. 

We will reuse the same text we worked with before, so we can also reuse the POS tags we created before. One peculiarity of NLTK is that, even though the POS tagger creates very detailed POS tags based on the Treebank corpus, the POS tags that the lemmatizer uses are much simpler. Instead of recognizing different versions of, say, nouns or verbs, the lemmatizer only requires us to separate our words into the following four groups:

 

  • nouns
  • verbs
  • adjectives
  • adverbs

This is easy to do. The first letter in the tags created by the tagger denotes one of these four main groups, and the other letters define the specifics. In other words, to convert our POS tags from their original format to the one that the WordNet lemmatizer expects, we can create a function that will read the first letter of the original tags. Then, based on that first letter, the function will replace the tag assigned by the tagger with the appropriate tag that the lemmatizer expects. Let's import Wordnet and create the function that will convert our tags from one format to the other:

# Import what we need
from nltk.corpus import wordnet

# Create function that converts tags 
# from the Treebank format to the WordNet format
def convert_pos_tags(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

Next, I will create the lemmatizer.

# Import what we need
from nltk.stem import WordNetLemmatizer

# Create the lemmatizer
lemmatizer = WordNetLemmatizer()

Now that everything we need to do lemmatization is ready, we can create an empty list and fill it by running a for loop that will do the following in each iteration:

 

  • iterate over a pair that consists of a word and its tag
  • convert the POS tag to a format recognized by WordNet using the function we defined earlier
  • if the converted POS tag is not available (None), lemmatize the word using the default POS tag (noun)
  • append the lemmatized word to the list of lemmatized words

The code for this process looks like this:

# Create an empty list where we will store lemmatized words
lemmatized_text_wordnet = []

# Convert POS tags
# and lemmatize using the WordNet algorithm
for word, tag in tagged_text:
    wordnet_tag = convert_pos_tags(tag)
    if wordnet_tag is None:
        lemmatized_word = lemmatizer.lemmatize(word) 
    else:
        lemmatized_word = lemmatizer.lemmatize(word, pos=wordnet_tag) 
    lemmatized_text_wordnet.append(lemmatized_word)

The code above will create the following list of strings:

['life', 'be', 'what', 'happen', 'when', 'you', 'be', 'busy', 'make', 'other', 'plan']

As you can see, all of the words we end up with in our list are real words. On the other hand, the lemmatizer won't lowercase text data on its own, so if you want your data to be lowercase you will have to do it separately after performing lemmatization.

To wrap up, the Natural Language Toolkit (NLTK) library is an invaluable tool in the world of text and language analysis. It simplifies tasks such as breaking down some text data into words or tokens (tokenization), identifying a word's role in a sentence (POS tagging), and finding the root form of words (stemming and lemmatization). NLTK equips us to deal with large text data more efficiently, paving the way for deeper understanding and more nuanced analysis of language. If you're delving into language or data analysis, NLTK is an essential tool to have at your disposal.

 

 

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.