The Ultimate Guide to Data Processing with Python

Most data analytics experts prefer to use Python to work with data. Python popularity can be attributed to two libraries in particular: the NumPy library and the Pandas library.
By Boris Delovski • Updated on Jan 3, 2023
blog image

With how much data you have at your disposal today, it’s more important than ever to know how to process it as efficiently and as effectively as possible. Nowadays, most data analytics experts prefer to use Python to work with data. This sudden popularity of Python can be attributed to two libraries in particular: the NumPy library and the Pandas library.

If you want to be an expert at processing data in Python you must be adept at using both libraries, even though you will often see people only mentioning Pandas online. Moreover, it is a much better idea to start your journey toward becoming a master at data processing with NumPy. 

The NumPy library is what the Pandas library is built on top off, which means that knowledge of NumPy will not only carry over when you start learning Pandas, but will make sure that all of the data processing pipelines you build are as efficient as possible. After learning NumPy you can start learning Pandas. This process will have a much gentler learning curve then it otherwise would because you will understand what is happening in the background when you use various methods to import, process and export data.

 

The Ultimate Guide to NumPy

NumPy stands for Numerical Python. It is the fundamental package for scientific computing and data analysis in Python. One of the biggest mistakes almost every beginner makes when learning data processing is skipping NumPy. That can be attributed to the extreme popularity of Pandas.  But in reality, learning Pandas without learning NumPy is a bad idea because Pandas is built on top of NumPy, so to understand how Pandas works, you essentially need to understand how NumPy works. Aside from Pandas a lot of other higher-level packages also use NumPy. 

The entirety of NumPy can be explained by focusing on four major topics, which each merit an article on their own:

 

 

Introduction to NumPy

Before going deeper into what you can do with NumPy you must first learn the following:

 

 

What is NumPy

One of the first things programmers learn is that there is no need to code everything from scratch. A plethora of different packages and libraries are available, and can be reused for your own purposes. NumPy is the premiere package in Python for scientific computing and data analysis. 

 

Why Do You Use NumPy

The next step after learning more about what NumPy is as a package is to understand why programmers prefer using NumPy over other packages. This mainly boils down to how easy NumPy is to use, how popular it is in the community and last but not least how well optimized it is. Once you understand these three main reasons behind why programmers prefer using NumPy, you will be ready to start learning about how NumPy works.

 

What Are NumPy ndarrays

After learning what NumPy is and the motivation for using it, you'll need to learn about the most important part of NumPy and those are ndarrays. To be more precise, to utilize the full potential of NumPy you need to understand the following:

 

 

 

What is an Ndarray

Ndarrays are the basic building block of NumPy. They are optimized containers for data that are used in NumPy instead of other common types of containers, such as lists, and are preferred because of their flexibility, speed and because of how space-efficient they are.

 

How Do You Create Ndarrays

It is very important to be able to integrate data processing pipelines into the rest of your code. Therefore you need to know not only how to create an ndarray from scratch, but also how to convert other data types, such as lists and tuples, into an ndarrays and vice-versa.

 

What Are the Characteristics of Ndarrays

Once you know what an ndarray is and how you can create it the next step is to learn what are the most important characteristics of ndarrays. Understanding concepts such as the dimensionality on an ndarray, the size and shape of an ndarray, etc., is of the utmost importance if you plan on using them for data processing.

 

How To Select Data in NumPy

Working with data in NumPy is relatively trivial once you understand what ndarrays are and what are their characteristics. At that point processing data with NumPy can be broken down to learning:

 

 

How to Index and Slice Ndarrays

Indexing and slicing ndarrays is something you will learn very quickly, as there is a lot of carry over from indexing and slicing lists. There are some differences, but in general the same ideas of positive indexing, negative indexing and using this indexing to create slices apply when you are working with ndarrays.

 

How to Use Boolean Arrays, Boolean Indexing and Filtering

Filtering data stored inside ndarrays is based upon using boolean indexing and boolean arrays. Using conditional statements you can create boolean arrays, which you can later use to filter data stored in a non-boolean array.

 

How to Modify Data

Learning how to modify data after you know how to work with boolean arrays and how to filter data is pretty easy, because modifying data stored in ndarrays boils down to using boolean arrays for slice value assignment. Essentially, after filtering data you can simply assign new values to those that were selected with the filtering procedure.

 

How to Use NumPy for Math and Statistics

Once you understand what ndarrays are, how to create them and how to manipulate them you can finally use them to create data processing pipelines. These data processing pipelines can be as complex as you want, but their overall structure is going to depend on what type of operation you want that pipeline to perform. Therefore it is important that you learn the following:

 

 

How to Perform Simple Operations

Before moving on to more complex operations you need to first master the simpler ones. Adding elements to arrays, adding two arrays together and other similar operations form the backbone of the more complex operations you will perform with ndarrays, so they are what you should focus on first.

 

How to Perform Vectorized Operations

These operations demonstrate the full power of NumPy, and are the reason why NumPy is so popular. Performing these types of operations is far more efficient than using for loops, for example. So learning how to use vectorized operations to optimize your code is of the utmost importance.  

 

How to Perform Math Operations

Statistics are based on performing various mathematical operations, so it is important to first focus on learning how to perform mathematical operations with NumPy. This includes operations such as multiplying the elements of some array, adding a value to all of the elements in an array, multiplying two arrays and much more.

 

How to Perform Statistical Operations

The goal of each data processing pipeline is to gain some knowledge about patterns that appear in the data that you are processing. The best way to gain insight into the information that some data holds is to perform statistical operations such as calculating the mean, calculating the standard deviation and doing the other statistical operations.

Article continues below

 

How to Use Conditional Logic to Transform Arrays

The last thing you should learn about working with NumPy is how to use the where() method. This method is very important because it allows you to use conditional logic to transform an ndarrays, and because it is widely used even with Pandas-specific data types such as Pandas DataFrames

 

The Ultimate Guide to Pandas

Once you learn the basics of NumPy you're ready to start learning about the most popular package for data analysis and processing in Python: the Pandas package. It has everything you need to perform various data processing and data analysis operations. There are other reasons why you'll prefer using Pandas, which I delve into in one of the articles of this series.

The Pandas library is best explained by focusing on eight major topics, which each merit an article on their own:

 

 

What is Pandas in Python

Pandas is a very complex library, so before explaining its most important functions in detail it is important to preface my explanation by focusing on:

 

 

Why Use Pandas

It is always a good idea to delve into the rationale behind using some library before trying to explain it in detail. This is especially true for Pandas; you need to understand why Pandas is the most popular library for data processing and analysis even though there are others designed to do the very same thing.

 

How to Install Pandas

Of course, before using a library you'll need to install it. If you use something such as Anaconda to manage your Pandas packages you will notice that, while some are part of the basic environment, Pandas is not one of them. If you install your packages with pip this comes without saying. Therefore you must make sure to install everything that is needed to run this library.

 

The Basics of Pandas

Even though I'll cover topics such as Pandas Series objects and Pandas DataFrames objects, it is still important to introduce their basic concepts. Understanding how a Pandas Series object works is much easier when you know the basics of how Pandas DataFrames work, and vice versa.

 

What Are Pandas Series Objects

The basics of how Pandas Series objects work is something I already covered in a previous article; however, this is not enough. Because these objects are the essential building blocks of every DataFrame, which in Pandas serve as the main data containers, learning more about these objects is very important. 

I’ll cover the following topics:

 

 

What is a Pandas Series

Before exploring the intricacies of Pandas Series objects, it is a good idea to not only refresh our memory on what they are but also to expand upon our initial definition of them.

 

Creating a Pandas Series

One of the main reasons why you’ll like to use Pandas Series objects is because they are very easy to create. Not only can you create them from scratch, but it is also straightforward to convert different types of Python objects into Series objects. A great example of that is converting a list into a Pandas Series object, but you can also convert other data types which is something you’ll need to spend a bit of time on.

 

Accessing Data in a Pandas Series Using Simple Indexing

Knowing how to create a Pandas Series object is important, but it is by itself not that useful if you don't know how to access data stored in one such object. The way you typically do this is by using indexing. While this works relatively similarly to accessing data stored in Python lists using indices some slight differences are worth mentioning.

 

What is a Pandas DataFrame and How to Create One

Pandas DataFrames are where you store your data when you want to process it and analyze it. Series objects are the building blocks of Pandas, but that is only because they are what Pandas DataFrames consist of, so you will often run into people referring to Pandas DataFrames as the elementary units of Pandas.

I’ll cover the following topics:

 

 

What is a Pandas DataFrame

Before going into what you can do with them, you first need to focus on what Pandas DataFrames are. To understand them as a topic one must not only be familiar with what they are, but first and foremost with why Pandas DataFrames are the preferred container for multidimensional data in Python.

 

Creating Pandas DataFrames

The best way to explain how certain objects in Python work is to focus on how you create them. Once you know how a Pandas DataFrame can be created from scratch, and also how you can convert other data containers, such as arrays into DataFrames, learning how to extract and manipulate data stored in them becomes very easy to understand.

 

Creating a Pandas DataFrame by Importing Data From Files

Creating DataFrames from scratch and converting other data types into Pandas DataFrames is something that you'll do from time to time, but it is not what is commonly done to create new DataFrames. In practice, DataFrames are most often created from various types of files, such as CSV files, Excel files, etc. Knowing how to load in such a file and create a Pandas DataFrame from it is therefore of the utmost importance, and should be one of the first things you learn how to do when learning Pandas.

 

How to Add, Rename, and Remove Columns in Pandas

Knowing how to create a Pandas DataFrame is useless without knowing how to manipulate data stored inside one. The most basic type of manipulating data stored in any type of tabular container is to learn how to add new columns, rename existing ones, and remove those that you don't plan on using in a particular project.

I will cover the following topics:

 

 

Adding Columns to a DataFrame

Very rarely do you start with a dataset that contains all the data that you need to solve some problem. Usually, you’ll need to add at least several new columns to your DataFrame during processing and analysis, so it is crucial to learn how to do that efficiently.

 

Renaming DataFrame Columns 

People are not perfect, so in practice, it is a common occurrence to run into columns with bad names in your data. These do not necessarily need to be column names with spelling errors, but are often columns that just have names that make it very hard to understand what the data in that particular column represents. Since this is a problem that can be solved very easily it is worth touching upon how to rename columns of Pandas DataFrames.

 

Removing Columns from a DataFrame

As mentioned previously, you’ll seldom start with a perfect dataset. While this means that you will often need to add data into your DataFrame, it also means that you often need to remove data from your DataFrame. You must learn how to do both.

 

How to Analyze Pandas DataFrames

Performing a detailed analysis of a DataFrame is one of the most common things you will always do.  However, from time-to-time you might want to just take a quick look at what your data looks like, or take a peek at some basic statistics that describe it. Performing a basic analysis of data stored in a Pandas DataFrame boils down to just a few lines of code, which makes it easy to get an initial idea of what data stored in some DataFrame looks like.

I’ll cover the following topics:

 

 

Taking a Look at Your DataFrame

When you store data in a Pandas DataFrame you’ll usually want to immediately after looking at it. Jupyter Notebooks do allow you to do that without invoking any methods, but the display you’ll get isn’t useful when you are working with a lot of data, which is a very common occurrence. It's a good idea to take a look at the Pandas methods you can use to display parts of your data in such a way that you'll actually get an idea of what the data stored in your DataFrame looks like.

 

Display Basic DataFrame Information

The goal of every data analysis is to get to certain conclusions and notice patterns. But it is important not to rush, and take your time before starting the analysis to make sure that the type of data stored in your columns is homogenous, that there aren’t any rows with missing data, etc. Since it is possible to get all of that information with a single line of code it is a step that every data analyst must never skip.

 

Perform Basic Statistical Analysis

Similar to how you can analyze the basic characteristics of the data stored in your DataFrame, you can also perform a basic statistical analysis with a single line of code. Not performing this analysis makes you lose out on a lot of information that you would otherwise use to get an initial idea of the basic characteristics of your data.

 

How to Create Pivot Tables in Pandas

Every data analysis procedure sooner or later includes creating a pivot table to summarize and aggregate data in some way. Creating pivot tables in Pandas is relatively easy for anyone who has ever worked with data before because it mimics how pivot tables are created in other popular data analysis tools, such as Excel.

I’ll cover the following topics:

 

 

The pivot_table() method

The simplest way to create a pivot table is to use the method that shares the name with what you're trying to create. The way the method works is pretty straightforward because it mimics how pivot tables are created in one of the most popular data processing and analysis tools, Excel. The only difference is the syntax used to create said pivot table because in this instance you're using Python.

 

The groupby() method

This method is an alternative to the aforementioned method. It leads to the same result. The main difference is in what tool is emulated using this method. While you can emulate how pivot tables are created in Excel using the pivot_table() method, you can also emulate how they are created in SQL using the groupby() method.

 

How to Merge DataFrames

In a previous article in this series, I cover how to add single columns to your DataFrame, and I explained the intuition behind doing so. However, it is worth mentioning that you will often need to not only add a single column, but potentially multiple columns. Because having multiple columns means that you’re working with a DataFrame object and not with a Series object, it’s more accurate to say that you’ll often need to merge multiple DataFrames, instead of saying that you add one DataFrame to another. 

There are different types of these merges. Some are used more often, some less often but each one is equally as important as the others. I’ll cover the following topics:

 

 

How to Merge DataFrames

It is important to, in advance of explaining each type of merge, also touch upon how merging in Pandas works in general, and how is it similar to merging data with other tools. Armed with this knowledge one can better understand when each of the different types of merges should be used in practice.

 

Left Merge

One of the most commonly used types of merges, it functions very similarly to a left outer join for those of you familiar with SQL. It creates a new DataFrame object by combining data from two DataFrame in such a way that all rows from the left-hand side DataFrame are included, while the rows from the right-hand-sided one are included only if they match. The syntax for performing this merge in Pandas is very easy and doesn't change much when you want to perform some other type of merge.

 

Right Merge

Essentially a mirrored version of the left merge procedure. Creates a new DataFrame that includes all rows from the right-hand side DataFrame, and only those rows from the left-hand side DataFrame that match. It functions very similarly to a right outer join for those of you familiar with SQL. The syntax for doing such a merge is nearly identical to the syntax used in the previously mentioned merge procedure.

 

Inner Merge

The inner merge procedure will create a new DataFrame that includes only those rows that have key values present in both DataFrames. It functions very similarly to an inner join for those of you familiar with SQL. The syntax for doing such a merge is nearly identical to the syntax used in the previously mentioned merge procedures.

 

Outer Merge

Creates a new DataFrame that includes all rows from both DataFrames. It functions very similarly to a full outer join for those of you familiar with SQL. The syntax for doing such a merge is nearly identical to the syntax used in the previously mentioned merge procedures.

 

How to Create Complex Derived Columns in Pandas

Simple derived columns are relatively easy to create because creating them boils down to combining data from already existing columns. However, a topic I still need to go over is how to create complex derived columns. 

I’ll cover the following topics:

 

 

Creating Derived Columns Using the where() Method

Technically speaking, the where() method is a method from NumPy and not from Pandas. However, because Pandas is built on top of NumPy, you can apply the method to data stored in Pandas DataFrames. The reason you'll prefer using this method is that it is a lot faster than some other alternatives because it leverages the power of vectorization. The where() method is so efficient it is a tool you must have in your arsenal.

 

Creating Derived Columns Using the apply() Method

While not as fast as the previously mentioned method it is a lot more flexible. It allows you to apply any custom function to the data stored in your DataFrame, which can prove to be invaluable in certain situations.

There are many prerequisites a data processing package must satisfy, and Pandas satisfies all of them which is why it is the premiere package for data analysis and processing in Python. After going through this series, you will understand not only the fundamentals of the basic building blocks of Pandas, the Pandas Series, and the Pandas DataFrame objects, but will also know how to leverage the power of Pandas to work with data in Python. If you read each article in it, you’ll easily be able to start working on your first project, and, with time, become an expert on using Pandas for data analysis and processing.

The only topics not covered in this series of articles are working with time series data and creating visualizations, but the reason for that is very simple: they are important enough to merit a series on their own, which is what we will focus on covering next.

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.