The Ultimate Python Pandas Cheat Sheet

A list of common pandas methods and their uses.
By by Boris Delovski • Updated on May 2, 2023
blog image

Table of Contents

 

How to Import Required Packages in Python Pandas

import pandas as pd
import numpy as np

 

 

How to Create a Series From a Python Pandas List

 

With Default Index:

numbers = pd.Series([10, 20, 30, 40, 50])

 

With Custom Index Values:

numbers = pd.Series(
    [10, 20, 30, 40, 50], 
    index = [1, 2, 3, 4, 5])

 

How to Create an Empty DataFrame With Specific Columns in Python Pandas

df = pd.DataFrame(columns=["first_column", "second_column"])

 

Create a DataFrame From Series Objects in Python Pandas

odds = pd.Series([1, 3, 5, 7, 9])
evens = pd.Series([2, 4, 6, 8, 10])

df = pd.DataFrame({"Odd numbers": odds, "Even numbers": evens})

 

 

How to Create a DataFrame From a Python Pandas Dictionary

df = pd.DataFrame({
    'Letter': ['c', 'e', 'z', 'o'], 
    'Anima': ['cat', 'elephant', 'zebra', 'orangutan']})

 

How to Create a DataFrame With a Custom Index in Python Pandas

df = pd.DataFrame({
    "A" : [1, 2, 3],
    "B" : [4, 5, 6],
    "C" : [7, 8, 9]},
    index = ['u', 's', 't'])

 

 

How to Read Data Into a DataFrame in Python Pandas

 

From a CSV File:

csv_data = pd.read_csv("input.csv")

 

Specify Which Columns to Import From a CSV File:

csv_data = pd.read_csv("input.csv", usecols=['col1', 'col4', 'col5'])

 

Automatically Parse Dates From a CSV File:

csv_data = pd.read_csv("input.csv", parse_dates=['Start Date', 'End Date'])

 

Read a CSV File with No Header:

Below, you're reading a CSV file with two columns named colA and colB.

csv_data = pd.read_csv("input.csv", header=None, names=['colA', 'colB'])

 

Read From a Pipe-Delimited File:

pipe_delimited_data = pd.read_csv("input.csv", sep='|')

 

Read From an Excel File From a Specific Sheet:

excel_data =  pd.read_excel("input.xlsx", sheet_name="Sheet3")

 

Read From a Fixed-Width File:

fwf_data = pd.read_fwf("input.txt")

 

Read From a JSON File:

json_data = pd.read_json("input.json")

 

Read From a Parquet File:

parquet_data = pd.read_parquet("input.parquet.gzip", engine="pyarrow")

 

Read From a Database:

You will need the sqlalchemy and psycopg2 Python packages:

pip install sqlalchemy
pip install psycopg2

And then:

from sqlalchemy import create_engine

conn = create_engine('postgresql://user:pwd@db-name.endpoint.us-east-1.rds.amazonaws.com:5432/postgres')
sql_data = pd.read_sql_query("SELECT * FROM my_table", conn)

WARNING: do not store credentials in code!

 

How to Export Data in Python Pandas

 

Export CSV Files:

data.to_csv('output.csv')

 

Export CSV Files Without the Index:

data.to_csv('output.csv', index=False)

 

Export Pipe-Delimited Files:

data.to_csv('output.csv', sep='|')

 

Export Excel Files:

You may need openpyxl and xlsxwriter:

pip install openpyxl
pip install xlsxwriter

And then:

data.to_excel('output.xlsx', engine='xlsxwriter')

 

Export Excel Files With a Specific Sheet Name:

data.to_excel('output.xlsx', engine='xlsxwriter', sheet_name='Sheet2')

 

Export Parquet Files:

data.to_parquet("output.parquet.gzip",  compression="gzip")

 

Export Databases or Data Warehouses:

Suggested method:

 

  • Output to CSV or Parquet files in your data lake (e.g., AWS S3 buckets)
  •  Bulk import using  copy ... from ... or equivalent methods (e.g., AWS Redshift can import data directly from AWS S3)

 

How to Add or Remove Data From a DataFrame in Python Pandas

 

Add a Column:

df["new_column_name"] = "default_value"

 

Remove Columns:

df.drop(columns=["column_1", "column_2"])

 

Remove Columns (In Place):

df.drop(columns=["column_1", "column_2"], inplace=True)

 

 

Append a Row to a DataFrame

df1.append({'colA': 'some_value', 'colB': 'another_value'}, ignore_index=True)

 

Append the Rows of One DataFrame to the Rows of Another DataFrame:

df1.append(df2, ignore_index=True)

 

Append the Rows of One DataFrame to the Rows of Another DataFrame (Alternative Way):

pd.concat([df1, df2])

 

Append the Columns of One DataFrames to the Columns of Another DataFrame:

pd.concat([df1, df2], axis=1)

 

Remove Rows Using Row Index Labels (In Place):

df.drop(index=['a', 'b', 'z'], inplace=True)

 

Remove Rows That Meet a Certain Condition (In Place):

The code below removes (in place) all the rows where the Customer Id value is 0:

indices_to_drop = df[ df['Customer Id'] == 0 ].index
df.drop(indices_to_drop, inplace=True)

 

How to Merge Data in Python Pandas

 

Left Merge (Similar to a SQL Left Join): 

pd.merge(df1, df2,
    left_on="df1_column", right_on="df2_column",
    how="left")

 

Article continues below

Right Merge (Similar to SQL Right Join):

pd.merge(df1, df2,
    left_on="df1_column", right_on="df2_column",
    how="right")

 

Inner Merge (Similar to SQL Inner Join):

pd.merge(df1, df2,
    left_on="df1_column", right_on="df2_column",
    how="inner")

 

Outer Merge (Similar to SQL Full Join):

pd.merge(df1, df2,
    left_on="df1_column", right_on="df2_column",
    how="outer")

 

Merging On Index:

pd.merge(df1, df2,
    left_index=True, right_index=True,
    how="outer")

NOTE: when merging on index, you can do a left, right, inner or outer merge as well.

 

How to Select Data Using Integer Positions in Python Pandas

The integers below refer to the row and column position (first, second, etc.). They do not represent column names or row index labels. Also note that iloc uses square brackets [ ], not parentheses ( ).

 

Select the First Row and Return a Series:

df.iloc[0]

 

Select the Ninth Row and Return a Series:

df.iloc[8]

 

Select the Second, Third, and Fourth Rows:

df.iloc[1:4]

 

Select the Last 2 Rows:

df.iloc[-2:]

 

Select the First, Fourth and Fifth Rows:

df.iloc[[0, 3, 4]]

 

Select the First Column and Return a Series:

df.iloc[:, 0]

 

Select the First Column and Return a One-Column DataFrame:

df.iloc[:, [0]]

 

Select the Last Column and Return a Series:

df.iloc[:, -1]

 

Select All Columns Except for the First and Last:

df.iloc[:, 1:-1]

 

Select the First, Third and Fourth Columns:

df.iloc[:, [0, 2, 3]]

 

Select the Last Two Columns of the Second and Third Rows:

df.iloc[[1, 2], -2:]

 

Select the First and Last Columns of the First and Last Rows:

df.iloc[[0, -1], [0, -1]]

 

Select Every Other Column of the First and Last Rows:

df.iloc[[0, -1], ::2]

 

How to Select Data Using Labels (Column Names and Row Index Labels)

Any integers below refer to actual row index labels, in a hypothetical scenario where those row index labels are integers. They do not refer to actual row or column positions.

 

Select a Column Using the Column Name and Return a Series:

df["column_name"]

 

Select a Column Using the Column Name and Return a One-Column DataFrame:

df[["column_name"]]

 

Select Multiple Columns Using Column Names:

df[["column_1", "column_2"]]

 

Select the Row(s) With the Row Index Label "a":

This will raise an error if no rows have the specified index label.

df.loc["a"]

 

Select the Row(s) With the Row Index Label 0:

This is not the same as selecting the first row. There may be one or more rows with the row index label 0. This will raise an error if no rows have the specified label.

df.loc[0]

 

Select the Row(s) Which Have the Specified Datetime Object as the Row Index Label:

df.loc[datetime.datetime(2030, 12, 31)]

 

Select all the Rows Located Between the Row With Label 'a' and the Row With Label 'u':

This includes the rows with labels 'a' or 'u'. It may include no rows, one row, or more rows.

df.loc['a':'u']

 

Select the Rows With Labels 'a', 'u', or 'd':

This will return a DataFrame with more than 3 rows if several rows have labels 'a', 'u,' or 'd'.

df.loc[ ['a', 'u', 'd'] ]

 

Select the Rows With Labels 1, 5, 6:

This will return a DataFrame with more than 3 rows if several rows have labels 1, 5, or 6.

df.loc[ [1, 5, 6] ]

 

Select the Column Named 'Country' and Return a Series:

df.loc[:, 'Country']

 

Select the Column Named 'Country' and Return a Single Column DataFrame:

df.loc[:. ['Country']]

 

Select the Columns Named 'Country' and 'Surface':

df.loc[:, ['Country', 'Surface']]

 

Select the Rows and Columns With the Specified Labels:

df.loc[ ["row_label_1", "row_label_22"], ["column_name_1", "column_name_2 "] ]

 

How to Select Data Using Lists of Boolean Values

 

Select Specific Rows and Columns Using Lists of Boolean Values:

The first list must have as many boolean values as there are rows. The second list must have as many booleans as there are columns. Only rows and columns corresponding to True values will be returned:

df.iloc[ [True, True, False, False, False], [True, True, False, False] ]

or

df.loc[ [True, True, False, False, False], [True, True, False, False] ]

 

Select Specific Rows and Columns Using Combinations of Integer Positions and Lists of Booleans:

The code below returns the first two columns of the second, third, fourth, fifth, sixth, and seventh rows in the DataFrame.

df.iloc[1:7, [True, True, False, False]]

 

Select Specific Rows and Columns Using Combinations of Row Index Labels and Lists of Booleans:

The code below returns the first two columns of all the rows in the DataFrame located between the row with label 1 and the row with label 7, including.

df.loc[1:7, [True, True, False, False]]

 

 

How to Select Data Randomly in Python Pandas

 

Randomly Select 'n=50' Rows

df.sample(n=50)

 

Randomly Select a Fraction of Rows

 df.sample(frac=0.5)

 

How to Aggregate and Summarize Data in Python Pandas

 

How to Summarize Data Using a Pivot Table:

Below

  • index will determine the row MultiIndex
  • columns will determine the column MultiIndex
  • values are the values to aggregate
  • aggfunc will determine the aggregation functions to apply to each of the value columns
data.pivot_table(
    index=['RepName', 'Location'], 
    columns=['Year', 'Month'],
    values=['Helpfulness', 'Courtesy', 'Empathy'], 
    aggfunc=['mean', 'std']
)

 

How to Summarize Data Using a Pivot Table and Include Totals:

data.pivot_table(
    index=['RepName', 'Location'], 
    columns=['Year', 'Month'],
    values=['Helpfulness', 'Courtesy', 'Empathy'], 
    aggfunc=['mean', 'std'],
    margins=True
)

 

How to Summarize data Using 'groupby':

data[ ['Helpfulness', 
      'Courtesy', 
      'Empathy', 
      'RepName', 
      'Date'] ].groupby( ['RepName', 'Date'] ).mean()

 

How to Compute Multiple Aggregations Using 'groupby':

data[ ['Helpfulness', 
      'Courtesy', 
      'Empathy', 
      'RepName', 
      'Date'] ].groupby( ['RepName', 'Date'] ).agg( ['mean', 'std'] )

 

'groupby' Syntax Explained:

Groupby syntax

 

'groupby' Mechanics Explained:

Groupby mechanics

 

How to Apply Custom Data Transformations

 

Create a Derived Column From Existing Columns:

df['new_column'] = (df['col_A'] * 3 + df['col_B']) / df['col_C']

 

Create a Derived Column Using String Transformations of Existing Columns:

df['uppercase_names'] = df['names'].str.upper()
df['lowercase_names'] = df['names'].str.lower()
df['name_initial'] = df['names'].str[0]

 

Create a Derived Column Using a Custom Function:

def low_score(row):
    if (row['Helpfulness'] < 2 and
        row['Empathy'] < 2 and
        row['Courtesy'] < 2):
        return 1
    return 0

data['Low Score'] = data.apply(low_score, axis=1)

 

Update Columns Which Meet a Condition With a Specified Value:

The code below updates all the missing values in the column Location to be N/A:

data.loc[ data['Location'].isnull(), ['Location'] ] = 'N/A'

The code below updates all the values in the column Amount that are less than 100 to be 150:

data.loc[ data['Amount'] < 100, ['Amount'] ] = 150

 

How to Filter Data in Python Pandas

 

Select Rows Based On Simple Condition and Include All Columns:

df[ df['customer_id'] == 1121 ]

 

Select Rows Based On More Complex Condition and Include All Columns:

Use & for logical 'and,'  | for logical 'or,' and ~ for 'negations.'

df[ (df['customer_id'] > 1121) & (df['customer_id'] < 2232) ]

 

Combine Indexing and Filtering:

The code below will return the values in columns Letter and Animal of all the rows where the first character in the Animal column is greater than 'd' (i.e. 'e,' 'f,' 'g,' ...):

df.loc[ df['Animal'].str[0] > 'd', ['Letter', 'Animal'] ]

The code below will extract the data in the columns whose names have more than 6 characters:

df.loc[:, df.columns.str.len() > 6]

 

How to Sort Data in Python Pandas

 

Sort Data in Ascending Order:

df.sort_values(by=['col1', 'col2'])

 

Sort Data in Ascending Order (In Place):

df.sort_values(by=['col1', 'col2'], inplace=True)

 

Sort Data in Descending Order:

df.sort_values(by=["col_A"], ascending=False)

 

Sort the Index of a DataFrame:

df.sort_index()

 

Sort the Index of a DataFrame (In Place):

df.sort_index(inplace=True)

 

Reset the Index of a DataFrame:

df.reset_index()

 

How to Rename Column/Row Index Labels

 

Rename Columns:

df.rename(columns={"old_column_1": "new_column_1", "old_column_2": "new_column_2"})

 

Rename Columns in Place:

df.rename(columns={"old_column_1": "new_column_1", "old_column_2": "new_column_2"}, 
    inplace=True)

 

Rename Row Index Labels:

In the code below, rows index label 0 is replaced with 1 and row index label 1 is replaced with 2.

 df.rename(index={0: 1, 1: 2})

 

How to Describe Data in Python Pandas

 

Display the First Five Columns of a DataFrame:

df.head()

 

Display the Last Five Columns of a DataFrame:

df.tail()

 

Display the Column Index:

The code below will print an object containing the column names. It can be iterated over.

df.columns

 

Display the Row Index:

df.index

 

Display the Shape of the Data (Number of Rows/Columns):

df.shape

 

Display Basic Information About the Data:

df.info()

 

Display Basic Descriptive Statistics:

df.describe()

 

Return the Mean of Each Column as a Series:

df.mean()

 

Return the Median of Each Column as a Series:

df.median()

 

Return the Standard Deviation of Each Column as a Series:

df.std()

 

Return the Highest Value From Each Column:

df.max()

 

Return the Lowest Value From Each Column:

df.min()

 

Return the Pearson Correlation Between Columns:

df.corr(method='pearson')

 

How to Handle Missing Data in Python Pandas

 

How to Check for Missing Data:

df.isnull().any()

 

How to Return the Number of Non-Null Values in Each Column:

df.count()

 

How to Remove Rows With Missing Data:

df.dropna()

 

How to Remove Columns With Missing Data:

df.dropna(axis=1)

 

How to Replace Missing Values With a Default Value (e.g., 0) in Place:

data['Amount'].fillna(0, inplace=True)

 

How to Use Data types in Python Pandas

 

Display the Data Type of Each Column:

df.dtypes

 

 

Change the Data Type of a Column:

data['Country'] = data['Country'].astype('category')

Note: replace data_type with data types such as str, int, float, etc.