It is undeniable that Excel is the most commonly used tool in data analysis. Its simplicity combined with how powerful it is when mastered makes it the top choice for many people. However, when compared with Pandas, it quickly becomes obvious just how limited Excel really is. It simply can't stand shoulder to shoulder with Pandas in terms of what it offers. So the real question is why do more people use Excel than Pandas?
The answer to that question is very simple. Most people are not familiar with Python and find the idea of learning any programming language scary. Also, humans are creatures of habit. Once we get into a routine, we tend to stay in it. For most data analysts, that routine is using Excel. Switching over to Pandas and learning some Python seems so problematic and time-consuming that most people would rather stay in their comfort zone than take the time to learn something completely new, and potentially hard.
But the truth is, learning Pandas is not hard, even for people that are unfamiliar with programming. A big misconception is that one needs to be an expert programmer to use a tool like Pandas. In reality, it requires just some basic Python knowledge. Of course, you could choose to keep learning after mastering the basics, but stopping when you learn just enough Python to use Pandas effectively isn't against any rules. Writing code is not exclusively reserved for programmers, and knowing a little bit of coding goes a long way.
All of this assumes that Pandas is a tool good enough to warrant spending some extra time to learn. Most people that switch to Pandas become so accustomed to its speed that they forget how much longer it took them to finish the same task using Excel. This becomes especially obvious once we consider the amount of automation that can be achieved using Python and Pandas.
Advantages of Pandas over Excel
The advantages of Pandas over Excel are just products of how Pandas works. Because it is built on NumPy (Numerical Python), Pandas boasts several advantages over Excel:
- Scalability - Pandas is only limited by hardware and can manipulate larger quantities of data.
- Speed - Pandas is much faster than Excel, which is especially noticeable when working with larger quantities of data.
- Automation - A lot of the tasks that can be achieved with Pandas are extremely easy to automate, reducing the amount of tedious and repetitive tasks that need to be performed daily.
- Interpretability - It is very easy to interpret what happens when each task is run, and it is relatively easy to find and fix errors.
- Advanced functions - Performing advanced statistical analyses and creating complex visualizations is very straightforward.
These are not the only advantages of Pandas over Excel. One of the biggest advantages comes from Python itself. Pandas is built to perfectly integrate with machine learning modules in Python. This brings about so many benefits that it would require an article of its own to properly explore. For the sake of brevity, we can just say that learning Pandas leaves data analysts only a step away from transitioning to machine learning engineers.
Pandas vs Excel comparison
Let's compare how a few common tasks are done in Excel, along with their counterparts in Pandas, to debunk the myth that Pandas is extremely hard and takes a lot of time to learn. We will use the IMDB-Movie-data dataset to demonstrate the following tasks:
- Looking at and representing data
- Selecting data
- Sorting data
- Filtering data
- Subtracting data
- Merging datasets
- Pivoting data
Looking at and representing data
In Excel, data is represented as an Excel spreadsheet. That spreadsheet contains columns, rows, and cells. There are no special commands for looking at a spreadsheet. It is automatically there when we open a file that is in a compatible format. Let's take a look at the spreadsheet that represents our dataset.
In Pandas, we read data from tables using the appropriate read method. For example, if we want to read an Excel file, we are going to use the read_excel method. Because the data we are using in this example is in the form of a CSV file, we are going to use the read_csv method. When we read data, we store our data in what we call a DataFrame to make sure we can easily access it. To look at our data, we are going to create a DataFrame and name it "movies". Once our DataFrame is created, we can look at it. To do that, we use the head() method. By default, it shows the top five rows, but for this example, let's display the top 16 rows.
movies = pd.read_csv("IMDB-Movie-data.csv") movies.head(16)
It is very easy to select data in Excel. We can use keyboard shortcuts or even just click with a mouse. Let's select the Metascore column by clicking on it.
Selecting data in Pandas is very straightforward. First, we write the name of our DataFrame, and then follow that up with the name of the column we want to select in square brackets. Let's select the same columns we selected in Excel, the "Metascore" column.
Selecting multiple columns is also easy. We just need to enter the name of the columns we want to select.
To demonstrate how sorting works, we are going to sort our data based on the "Metascore" column in ascending order. To sort data in Excel, we need to select the data that we want to sort.
Afterward, we can click the "Sort and Filter" button. When given the option to choose how we want to sort, let's select the option to sort our data by the values inside the "Metascore" column in ascending order.
The result we will get is:
We successfully sorted our data. The movie with the smallest value in the "Metascore" column is now the first row of our table.
To sort data in Pandas, we will use the sort_values() method. We just need to define the column that we want to sort by. By default, Pandas sorts in ascending order.
To demonstrate filtering data, we are going to filter our dataset to show only movies that came out in 2016. First, we need to select the first cell in the "Title" column. Then we need to go to the "Data" tab and click on "Filter".
We can see little arrows appear next to the column names.
If we click on the arrow next to "Year", we will see that currently, all years are selected.
We want to look at only the movies in the year 2016, so we need to select only that year.
The result we get is a spreadsheet that holds only those rows where the value inside the "Year" column is 2016.
Filtering data in Pandas is very similar to selecting data. Let's filter our original DataFrame. We need to select the DataFrame we want to filter and then write the condition inside the square brackets.
We need to specify that we want to use the "movies" DataFrame, and that we want to look at only the part of the DataFrame where the value of the "Year" column is 2016.
movies[movies["Year"] == 2016]
To subtract in Excel, we need to first enter a new column name. Let's call that column "Quality_difference".
That column is going to hold the values that we get when we subtract the value inside the "Rating" column from the value inside the "Metascore" column. To do that, we need to click on the cell beneath the one holding the column name and enter the formula.
Once the formula has been entered and the value for that field has been calculated, we can double click on the bottom part of that field to propagate the formula through the whole column.
To achieve the same result in Pandas, we need to write a single line of code. We first specify the name of the new column, and then how to calculate the value that will be inside it.
movies["Quality_difference"] = movies["Metascore"] movies["Rating"]
We achieve the same result as with Excel, but with a lot less work.
If we have two spreadsheets and want to merge them using some columns they have in common, we use VLOOKUP(). Let's merge our original table with one that holds the "Title" and "Reader_score" columns.
To connect the second table to the original one based on the "Title" columns (since they both have it), we will first create a column named "Reader_score" in our first spreadsheet. This is the name of the column that will be added when we merge our first table with the second one.
After we have created the column, we need to merge the tables based on the "Title" column. To do that, we will click into the first empty cell of the column we just created in our original table and start writing the VLOOKUP() code.
click ctrl+A and it will take us to a screen that looks like this:
In the "Lookup_value" field, we need to enter the first row of the "Title" column, which is B2. Since we will also propagate that code through the whole column later, we will freeze this value so the result looks like this:
Afterward, in the "Table_array" field, we need to input the whole second table we have, starting with A1 and finishing with B1001. When we do that, we will also freeze these values to make sure we can propagate the formula through the whole column. The result looks like this:
Next, we need to add a value to the "Col_index_num" field. We will enter the number two because that is the number of the column that we want to merge with our first spreadsheet:
To finish, we will set "Range_lookup" to 0. This will signal to the program that we are searching for exact matches.
The result we get from this is:
To copy the formula, we can just double-click on the bottom part of the first row.
It is very easy to merge datasets using Pandas. We can perform the whole procedure with two lines of code, using the merge function. The first line loads the second dataset into a second DataFrame, and the next line performs the merging procedure. When merging, we specify:
- the first dataset for the merge
- the second dataset for the merge
- how we want to merge our data
- the column they have in common
reader_scores = pd.read_csv("reader_scores.csv") movies = pd.merge(movies, reader_scores, how="left", on="Title")
The merging procedure in Pandas is very flexible. Aside from performing merges like this, we can also modify the merging procedure by modifying the "how" variable inside our code. We can use left merges, right merges, inner merges, and outer merges by changing only one variable.
To pivot a table in Excel, the first thing we need to do is go to the "Insert" tab and then click on "PivotTable". Let's select our current table and select that we want to create a new worksheet. Follow that up by clicking on OK:
Our screen will change to look like this:
In the right panel, we need to select that we want to look at the "Year" and "Runtime (Minutes)" columns. We also need to select "Year" under ROWS and "Runtime (Minutes)" under VALUES. The resulting table will look like this:
By default, it will sum up the values in the "Runtime(Minutes)" column. Since we want to find out the average length of a movie for each year, we need to click under VALUES on the bottom right and then select "Value Field Settings".
Once inside the panel, we can select that we want to show the average values:
Our final result is:
To pivot data, we use the Pandas pivot_table function. We need a single, although a bit longer, line of code to create pivot tables in Pandas. We need to specify:
- a name for our new table that will be created by pivoting
- the dataset that we want to pivot by defining it in the data variable
- the column that will be our index by defining it in the index variable
- the column that we want to analyze by defining it in the values variable
- that we want to use the "mean" function by defining it in the aggfunc variable
Let's separate that one line into multiple lines to match the steps above.
table = pd.pivot_table( data=movies, index="Year", values="Runtime (Minutes)", aggfunc= "mean")
To finish, let's demonstrate the true power of Pandas by showing just how much we can achieve with very little code. We will show off two different methods that give us a lot of information and require only one line of code.
A very useful Pandas method is the describe() method. We use it for descriptive statistics on all variables in our dataset. By default, it analyzes only numeric data, but we can modify it to also analyze non-numeric data.
For numeric data, the result’s index will include:
- count - describes the number of data points inside a column
- mean - mean value
- std - standard deviation
- min - the smallest value
- 25% - 25th percentile values
- 50% - 50th percentile value
- 75% - 75th percentile value
- max - the biggest value
For non-numeric data, the result’s index will include:
- count - describes the number of data points inside that column
- unique - shows the number of unique values
- top - shows the most common value
- freq - shows the most common value’s frequency
Code for numeric:
Code for non-numeric:
We use the info() method to find out some general information about our dataset. It will give us information about:
- the type of data we have in each column
- how many data points there are
- how much memory our dataset takes up
Once you learn how to use it, Pandas is, in many aspects, superior to Excel. Speed, scalability, automation, etc. are all advantages of Pandas over Excel and should motivate you to at the very least consider changing the tool they use for data analysis. However, perhaps the biggest benefit to using Pandas for this type of data analysis is that Pandas is easy.
If you don't know how to program, anything connected to programming might seem hard and time-consuming, but this couldn't be further from the truth when it comes to Pandas. If your work involves a lot of data wrangling or analysis, switching from Excel to Pandas could help you automate a great deal of the more boring parts of your work.
Gaining some basic proficiency in Pandas takes very little time, even if you have no programming experience. Just as with every other tool, Pandas is very easy to use on a basic level. Unlike other tools, you can achieve a lot using just the basics of Pandas. In fact, the basics might be all you ever need to learn.
In this article, we showed you how to perform a few of the most common tasks in both Pandas and Excel. Excel might seem simpler to use at first because a lot of tasks can be completed by opening different tabs or clicking on cells. Hopefully the examples in this article proved that you can perform different tasks much faster in Pandas than in Excel. This does not mean that you have to completely switch to Pandas from Excel. Instead, you can start by using Pandas for those tasks that you find tedious and repetitive.
Change is hard, and learning to use a new tool can seem difficult and time-consuming enough to give up before even starting. Give Pandas a try! You might find that you enjoy using it a lot, or that you enjoy all the time it can save you. In the end, the important thing is to keep an open mind and at least try out a tool that might make your life simpler.