How Is Pandas Enhancing Data Analysis in Excel

This article explores the integration of Pandas in Excel.

Microsoft Excel has traditionally been the go-to tool for data analysis due to its versatility and user-friendly interface. It enables users to conduct complex calculations, statistical analysis, and data visualization without requiring advanced programming skills. However, over time, it has slowly started losing its throne to Pandas, at least in some companies.

Intro to Pandas: What is Pandas in Python

Although Pandas requires users to invest time in learning Python programming, this investment is minor compared to the significant advantages it offers over Excel. These benefits include efficient handling of large datasets, faster processing speeds, greater flexibility, simplified automation of tasks, and smooth integration with machine learning algorithms. As a result, many professionals have transitioned to Pandas to make use of its powerful capabilities.

Intro to Programming: How to Write and Run Code

Initially, it seemed inevitable that Excel would remain behind. However, the developers of Excel introduced a game-changing idea: integrating the power and flexibility of Python and Pandas directly into Excel.

In a previous article, we covered the fundamentals of using Python within Excel. Now, we will take a closer look at whether integrating Python into Excel can fully replicate the advantages of Pandas.

Python in Excel: The Future or Merely a Gimmick?

What Are the Differences between Python in Excel and Pandas

The simplest method to evaluate if Python in Excel matches the feel and functionality of Pandas in Python is to compare specific operations in both environments. However, this comparison might be somewhat misleading since Python integration in Excel allows users to import Python libraries just as they would in a standard Python file. Notably, one of these libraries is Pandas itself. Therefore, the comparison of Python in Excel with standard Pandas in Python environments becomes more complex. Essentially, it turns into a comparison between using Pandas directly from Excel and using it traditionally, such as running code in a Jupyter Notebook.

Since the code itself remains consistent whether executed in Python cells within Excel or code cells in a Jupyter Notebook, there is little to compare in terms of the code itself. Instead, the key comparison lies in the results generated by these two different environments. Specifically, Python code in Excel runs on Microsoft Azure in the cloud. Therefore, we will evaluate the outputs from both environments. The purpose of this is to determine if using Pandas from Excel is as efficient as running Pandas code in a Jupyter Notebook or if it falls significantly short.

How to Effectively Examine and Represent Data

When analyzing and processing data with Pandas, the initial step is usually to load the data and create a DataFrame. This process is quite simple, typically achieved using a single function in Pandas. The specific function depends on the file type of data. However, no matter the type of data, all we usually need to do is write a single line of code. For instance, to read data from an Excel file we would use the read_excel function, while for a CSV file, we would use the read_csv function, and so on.

Intro to Pandas: What is a Pandas DataFrame and How to Create One

In Excel, our data is already “loaded in”. To be more precise, in Excel, we work with spreadsheets that contain columns, rows, and cells where our data is already stored. Therefore, there is no need for any type of explicit data-loading process. However, there is an additional step to it. If you want to apply Pandas DataFrame methods for data processing, you first need to create a DataFrame from the data in your worksheet.

In this article, to demonstrate the usage of Pandas in Excel, we will work with the survey_sample.csv file. You can download this file by going to this link:

https://edlitera-datasets.s3.amazonaws.com/survey_sample.csv

Once you do so, just open the file with Excel. This is how the data in the file looks like:

Data 1

As can be seen, we have the following columns:

Customer Id
Helpfulness
Courtesy
Empathy
Overall Satisfaction
Facilities
Location
Rep Id
Date

The file contains ratings of sales representatives based on their skills, as well as ratings of overall customer satisfaction with the service and the facilities. Before we begin analyzing the data, we will create a Python cell to import the Pandas library. Libraries in Python are usually imported at the top of the code. Since code gets executed from top to bottom, importing libraries that we plan on using at the top helps prevent errors later in the execution.

Intro to Pandas: How to Add, Rename, and Remove Columns in Pandas

To import Pandas, click on a cell near our data, in the top row, convert it into a Python cell, and then just type in the necessary code:

Data 2

When using libraries in Python it is common to import libraries under an alias, because we often need to type in the library name to invoke certain functions from it. Pandas is usually imported under the alias pd because that significantly shortens how much code we need to write. Yet, it remains descriptive enough for others to understand which library is being used.

Because Excel automatically displays the output of the code executed you will see None when importing a library. The operation of importing a library enables us to use that library, but in essence, there is nothing returned. So if you successfully imported Pandas, the cell should return None:

Data 3

As a warning, most data processing libraries and visualization libraries are already preloaded. In other words, there is no need at all to import Pandasat to use it. However, in case you want more niche libraries, you will need to import them as demonstrated above.

With Pandas ready, we can go ahead and create a DataFrame from our data. To achieve this, we can simply use the DataFrame constructor from Pandas, specifying the data from the worksheet as the source for creating the DataFrame:

DataFrame 1

Now that the DataFrame has been created and stored inside the data variable, we can start performing operations on it. The simplest operation we can perform is just taking a look at the DataFrame itself. To be more precise, let us use the head() method to display the first five rows of the DataFrame:

DataFrame 2

This is the time to mention that there are two modes for running this code in Excel. By default, when you run the code in Excel, it returns the object resulting from the code execution. For instance, using the head() method will return a DataFrame containing only the first five rows of the main DataFrame. However, Excel does not display this new DataFrame directly. To display the DataFrame in the Excel cell rather than just showing the object type, you need to use the small toolbar next to the formula bar in Excel. Select that you want Excel to return as output NOT the Python object, but instead the Excel value:

Python Output

When you select that, the code is going to run again, but this time you will get a display of the object (in this case DataFrame):

DataFrame 3

As can be seen, we get a good display of the first five rows of our DataFrame. Comparing the first five rows here with the first five rows in the columns we have in the worksheet, you will notice that the data is identical. In other words, we successfully created a DataFrame from the data stored in the worksheet.

Article continues below

Want to learn more? Check out some of our courses:

Understanding and Deploying Edge AI

Learn More

Apache Spark and Data Stream Processing: A Crash Course

Learn More

Classic Machine Learning with Python

Learn More

How to Compute Statistics Using Pandas in Excel

To be fair, displaying our DataFrame and selecting data within is not the primary reason for using Pandas in Excel. After all, we can already see what our data looks like because it is there in our worksheet. Furthermore, some simple operations such as sorting and filtering are also easily performed in Excel. Therefore, there is no real benefit to focusing on these aspects. Instead, let us demonstrate the true power of Pandas in Excel by exploring two important methods that enable detailed data analysis with just a single line of code:

info() method
describe() method

How to Perform a Basic Analysis of Our Data

Using the info() method we can perform a basic analysis of our data. To be more precise, we can find out the following:

the number of rows and columns in the dataset, i.e. the total number of data points
the type of data we have in each column
the amount of memory utilized by our dataset stored in a DataFrame

To use the info() method, we just need to apply it to our data stored in the DataFrame. The result we will get looks like this (keep in mind to set up the calculation so that it returns Excel values, and not a Python object, the same way we did when we used the head() method):

Analysis of Our Data

As can be seen, certain outputs that have quite a particular format in Pandas will be displayed in the Diagnostics window. This is the same window where we would usually be warned about errors in our Pandas code. From the display in the Diagnostics window we can infer the following:

there are 902 rows and 9 columns in our dataset
seven columns contain numerical data
one column contains strings i.e. text
there is one column that contains timestamps
the “Location” column is missing some data
this DataFrame takes up approximately 64 KB of memory

That is an abundance of useful information we receive from just running a single line of code. Finding out all of this information in Excel would take us far longer.

How to Calculate Descriptive Statistics

A particularly useful Pandas method is the describe() method. It helps obtain descriptive statistics about your dataset. By default, it computes statistics only for numeric variables, but it can be adjusted to include non-numeric variables as well.

For numeric data, the statistics included in the results are:

Count: The number of entries in a column
Mean: The average value across entries
Std: Standard deviation, which measures the amount of variation
Min: The smallest value
25%: The 25th percentile
50%: The median or 50th percentile
75%: The 75th percentile
Max: The largest value

For non-numeric data, the statistics displayed are:

Count: The number of entries in the column
Unique: The number of distinct entries
Top: The most frequent entry
Freq: The frequency of the most frequent entry

Let us first demonstrate the results we get when analyzing the numeric columns in our dataset. To be more precise, we will ignore the Customer Id and Rep Id columns when calculating statistics because those two represent identification numbers. Also, we will ignore the Date column because it contains timestamps. Finally, since the Location column contains text we will also ignore that column. To calculate descriptive statistics, we simply need to specify the columns of interest and after that use the describe() method on that part of our DataFrame:

Calculate Descriptive Statistics

Once again, calculating all of these values would take significantly longer if we didn’t use Pandas. This way, we can create a detailed statistical analysis of our data using a single method.

For the sake of demonstration, let us select just the Location columns as our column of interest to show how our result will look like when we analyze non-numeric data:

Calculate Descriptive Statistics 2

How to Merge Data from Different Sources

Frequently, we need to consolidate data from various sources before analysis. In Excel, merging two spreadsheets with overlapping columns typically involves using the VLOOKUP() function. While effective, this method isn't the most efficient, as it requires several steps to combine the datasets based on shared columns. Conversely, using Pandas in Excel can simplify this process significantly. We are going to merge the data we have in our spreadsheet with the data that can be found on this link:

https://edlitera-datasets.s3.amazonaws.com/survey_mappings.xlsx

More precisely, we will merge our data with the data from the Reps sheet of the Excel file you can find on the link above. First, we will copy the data from that sheet to the one we are working with, and we will create a DataFrame from it. Let us call that DataFrame reps:

Merge Data from Different Sources

Now that our second DataFrame is ready, we can merge our two DataFrames using the merge() method from Pandas. To merge them using this method we just need to define which columns they have in common, and what type of merge we want to perform. In our case, we want to end up with a DataFrame that contains our original data together with a Rep Name column, a column that contains the names of the representatives connected to unique representative Ids. Therefore, we will perform a left merge based on the Rep Id columns in our two DataFrames. Let us also immediately display the first five rows of the resultant DataFrame:

Merge Data from Different Sources 2

As can be seen, we successfully merged the two DataFrames based on the information stored in the Rep Id column.

How to Summarize and Aggregate Data

Pivoting data is a frequent task in Excel, typically involving numerous clicks and selections through various menus. However, when using the Pandas library in Python, this process can be streamlined significantly. With Pandas, the same result can be accomplished using a single line of code, albeit somewhat longer, by employing the pivot_table() method. Let us demonstrate how to calculate the mean, median, and standard deviation for each representative across two different years using this approach:

define which columns we want to pivot around i.e. which columns we want to have in our index by defining them in the index argument of the method
define the columns for which we want to calculate the mean, median, and standard deviation by defining them in the values argument of the method
define that we want to calculate the mean, median, and standard deviation by defining them in the aggfunc argument in the method

Summarize and Aggregate Data

As can be seen, creating a pivot table, even one where we pivot around multiple columns and in which we aggregate data using multiple functions is quite easy.

In this article, we demonstrated the simplicity and power of using Pandas directly within Excel. We showcased how simple it is to execute complex operations with this integration. While the efficiency gains may not be immediately apparent in our small sample of about 900 rows, the benefits of Pandas become much more significant as the dataset size increases. Pandas offers superior scalability compared to traditional Excel methods. Therefore, integrating Python and Pandas into Excel is more than just a novelty. Even in its current preview and beta-testing phase, it represents a substantial enhancement. This functionality is likely to become essential for Excel users in the future, revolutionizing how we approach data analysis within spreadsheets.

How Is Pandas Enhancing Data Analysis in Excel

What Are the Differences between Python in Excel and Pandas

How to Effectively Examine and Represent Data

Want to learn more? Check out some of our courses:

How to Compute Statistics Using Pandas in Excel

How to Perform a Basic Analysis of Our Data

How to Calculate Descriptive Statistics

How to Merge Data from Different Sources

How to Summarize and Aggregate Data

CEO | Data Scientist

Claudia Virlanuta

How to Manage a Data Science Project

How to Handle Time Series Data Sets

Python Data Processing: What is NumPy?

How Is Pandas Enhancing Data Analysis in Excel

What Are the Differences between Python in Excel and Pandas

How to Effectively Examine and Represent Data

Want to learn more? Check out some of our courses:

How to Compute Statistics Using Pandas in Excel

How to Perform a Basic Analysis of Our Data

How to Calculate Descriptive Statistics

How to Merge Data from Different Sources

How to Summarize and Aggregate Data

CEO | Data Scientist

Claudia Virlanuta

Read Next

How to Manage a Data Science Project

How to Handle Time Series Data Sets

Python Data Processing: What is NumPy?