Python Pandas Cheat Sheet

Last updated on Jul 16, 2021

Import required packages

import pandas as pd
import numpy as np

 

Create a Series from a list

With default index:

numbers = pd.Series([10, 20, 30, 40, 50])

With custom index values:

numbers = pd.Series(
    [10, 20, 30, 40, 50], 
    index = [1, 2, 3, 4, 5])

 

Create an empty DataFrame, with specific columns

df = pd.DataFrame(columns=["first_column", "second_column"])

 

Create a DataFrame from Series objects

odds = pd.Series([1, 3, 5, 7, 9])
evens = pd.Series([2, 4, 6, 8, 10])

df = pd.DataFrame({"Odd numbers": odds, "Even numbers": evens})

 

Create a DataFrame from a dictionary

df = pd.DataFrame({
    'Letter': ['c', 'e', 'z', 'o'], 
    'Anima': ['cat', 'elephant', 'zebra', 'orangutan']})

 

Create a DataFrame with a custom index

df = pd.DataFrame({
    "A" : [1, 2, 3],
    "B" : [4, 5, 6],
    "C" : [7, 8, 9]},
    index = ['u', 's', 't'])

 

Read data into a DataFrame

From a CSV file:

csv_data = pd.read_csv("input.csv")

From a CSV file, specifying which columns you want to import:

csv_data = pd.read_csv("input.csv", usecols=['col1', 'col4', 'col5'])

From a CSV file, automatically parsing dates:

csv_data = pd.read_csv("input.csv", parse_dates=['Start Date', 'End Date'])

From a CSV file with no header:

Below, we're reading a CSV file with two columns, which we'll name colA and colB.

csv_data = pd.read_csv("input.csv", header=None, names=['colA', 'colB'])

From a pipe-delimited file:

pipe_delimited_data = pd.read_csv("input.csv", sep='|')

From an Excel file, from a specific sheet:

excel_data =  pd.read_excel("input.xlsx", sheet_name="Sheet3")

From a fixed-width file:

fwf_data = pd.read_fwf("input.txt")

From a JSON file:

json_data = pd.read_json("input.json")

From a Parquet file:

parquet_data = pd.read_parquet("input.parquet.gzip", engine="pyarrow")

From a database:

You will need the sqlalchemy and psycopg2 Python packages:

pip install sqlalchemy
pip install psycopg2

And then:

from sqlalchemy import create_engine

conn = create_engine('postgresql://user:[email protected]:5432/postgres')
sql_data = pd.read_sql_query("SELECT * FROM my_table", conn)

WARNING: do not store credentials in code!

 

Export data

To CSV files:

data.to_csv('output.csv')

To CSV files, without the index:

data.to_csv('output.csv', index=False)

To pipe-delimited files:

data.to_csv('output.csv', sep='|')

To Excel files:

You may need openpyxl and xlsxwriter:

pip install openpyxl
pip install xlsxwriter

And then:

data.to_excel('output.xlsx', engine='xlsxwriter')

To Excel files, specifying the sheet name:

data.to_excel('output.xlsx', engine='xlsxwriter', sheet_name='Sheet2')

To Parquet files:

data.to_parquet("output.parquet.gzip",  compression="gzip")

To databases or data warehouses:

Suggested method:

  • output to CSV or Parquet files in your data lake (e.g. AWS S3 buckets)
  •  bulk import using  COPY... FROM... or equivalent methods (e.g. AWS Redshift can import data directly from AWS S3)

 

Add or remove data from a DataFrame

Add a column:

df["new_column_name"] = "default_value"

Remove columns:

df.drop(columns=["column_1", "column_2"])

Remove columns (in place):

df.drop(columns=["column_1", "column_2"], inplace=True)

Append a row to a DataFrame

df1.append({'colA': 'some_value', 'colB': 'another_value'}, ignore_index=True)

Append the rows of one DataFrame to the rows of another DataFrame:

df1.append(df2, ignore_index=True)

Append the rows of one DataFrame to the rows of another DataFrame (alternative way):

pd.concat([df1, df2])

Append the columns of one DataFrames to the columns of another DataFrame:

pd.concat([df1, df2], axis=1)

Remove rows using row index labels, in place:

df.drop(index=['a', 'b', 'z'], inplace=True)

Remove rows which meet a certain condition, in place:

The code below removes (in place) all the rows where the Customer Id value is 0:

indices_to_drop = df[ df['Customer Id'] == 0 ].index
df.drop(indices_to_drop, inplace=True)

 

Merge data

Left merge (similar to a SQL left join): 

pd.merge(df1, df2,
    left_on="df1_column", right_on="df2_column",
    how="left")

Right merge (similar to SQL right join):

pd.merge(df1, df2,
    left_on="df1_column", right_on="df2_column",
    how="right")

Inner merge (similar to SQL inner join):

pd.merge(df1, df2,
    left_on="df1_column", right_on="df2_column",
    how="inner")

Outer merge (similar to SQL full join):

pd.merge(df1, df2,
    left_on="df1_column", right_on="df2_column",
    how="outer")

Merging on index:

pd.merge(df1, df2,
    left_index=True, right_index=True,
    how="outer")

NOTE: when merging on index, we can do a left, right, inner or outer merge as well.

 

Select data using integer positions

The integers below refer to the row and column position (first, second, etc.). THEY DO NOT REPRESENT COLUMN NAMES OR ROW INDEX LABELS. Also note that iloc uses square brackets [ ], not parantheses ( ).

Select the first row and return a Series:

df.iloc[0]

Select the ninth row and return a Series:

df.iloc[8]

Select the second, third and fourth rows:

df.iloc[1:4]

Select the last 2 rows:

df.iloc[-2:]

Select the first, fourth and fifth rows:

df.iloc[[0, 3, 4]]

Select the first column and return a Series:

df.iloc[:, 0]

Select the first column and return a one-column DataFrame:

df.iloc[:, [0]]

Select the last column and return a Series:

df.iloc[:, -1]

Select all columns except for the first and last:

df.iloc[:, 1:-1]

Select the first, third and fourth columns:

df.iloc[:, [0, 2, 3]]

Select the last two columns of the second and third rows:

df.iloc[[1, 2], -2:]

Select the first and last columns of the first and last rows:

df.iloc[[0, -1], [0, -1]]

Select every other column of the first and last rows:

df.iloc[[0, -1], ::2]

 

Select data using labels (column names and row index labels)

Any integers below refer to actual row index labels, in a hypothetical scenario where those row index labels are integers. THEY DO NOT REFER TO ACTUAL ROW OR COLUMN POSITIONS.

Select a column using the column name and return a Series:

df["column_name"]

Select a column using the column name and return a one-column DataFrame:

df[["column_name"]]

Select multiple columns using column names:

df[["column_1", "column_2"]]

Select the row(s) with the row index label 'a':

Will raise an error if no rows have the specified index label.

df.loc["a"]

Select the row(s) with the row index label 0:

This is not the same as selecting the first row. There may be one or more rows with the row index label 0. Will raise an error if no rows have the specified label.

df.loc[0]

Select the row(s) which have the specified datetime object as the row index label:

df.loc[datetime.datetime(2030, 12, 31)]

Select all the rows located between the row with label 'a' and the row with label 'u':

Will include the rows with labels 'a' or 'u'. May include no rows, one row or more rows.

df.loc['a':'u']

Select the rows with labels 'a', 'u', or 'd':

Will return a DataFrame with more than 3 rows if several rows have labels 'a', 'u' or 'd'.

df.loc[ ['a', 'u', 'd'] ]

Select the rows with labels 1, 5, 6:

Will return a DataFrame with more than 3 rows if several rows have labels 1, 5 or 6.

df.loc[ [1, 5, 6] ]

Select the column named 'Country' and return a Series:

df.loc[:, 'Country']

Select the column named 'Country' and return a single column DataFrame:

df.loc[:. ['Country']]

Select the columns named 'Country' and 'Surface':

df.loc[:, ['Country', 'Surface']]

Select the rows and columns with the specified labels:

df.loc[ ["row_label_1", "row_label_22"], ["column_name_1", "column_name_2 "] ]

 

Select data using lists of boolean values

Select specific rows and columns using lists of boolean values:

The first list must have as many boolean values as there are rows. The second list must have as many booleans as there are columns. Only rows and columns corresponding to True values will be returned:

df.iloc[ [True, True, False, False, False], [True, True, False, False] ]

or

df.loc[ [True, True, False, False, False], [True, True, False, False] ]

Select specific rows and columns using combinations of integer positions and lists of booleans:

The code below returns the first two columns of the second, third, fourth, fifth, sixth and seventh rows in the DataFrame.

df.iloc[1:7, [True, True, False, False]]

Select specific rows and columns using combinations of row index labels and lists of booleans:

The code below returns the first two columns of all the rows in the DataFrame located between the row with label 1 and the row with label 7, including.

df.loc[1:7, [True, True, False, False]]

 

Select data randomly

Randomly select n=50 rows

df.sample(n=50)

Randomly select a fraction of rows

 df.sample(frac=0.5)

 

Aggregate and summarize data

Summarize data using a pivot table:

Below

  • index will determine the row MultiIndex
  • columns will determine the column MultiIndex
  • values are the values to aggregate
  • aggfunc will determine the aggregation functions to apply to each of the value columns
data.pivot_table(
    index=['RepName', 'Location'], 
    columns=['Year', 'Month'],
    values=['Helpfulness', 'Courtesy', 'Empathy'], 
    aggfunc=['mean', 'std']
)

Summarize data using a pivot table and include totals:

data.pivot_table(
    index=['RepName', 'Location'], 
    columns=['Year', 'Month'],
    values=['Helpfulness', 'Courtesy', 'Empathy'], 
    aggfunc=['mean', 'std'],
    margins=True
)

Summarize data using groupby:

data[ ['Helpfulness', 
      'Courtesy', 
      'Empathy', 
      'RepName', 
      'Date'] ].groupby( ['RepName', 'Date'] ).mean()

Compute multiple aggregations using groupby:

data[ ['Helpfulness', 
      'Courtesy', 
      'Empathy', 
      'RepName', 
      'Date'] ].groupby( ['RepName', 'Date'] ).agg( ['mean', 'std'] )

Groupby syntax explained:

Groupby mechanics explained:

 

Apply custom data transformations

Create a derived column from existing columns:

df['new_column'] = (df['col_A'] * 3 + df['col_B']) / df['col_C']

Create a derived column using string transformations of existing columns:

df['uppercase_names'] = df['names'].str.upper()
df['lowercase_names'] = df['names'].str.lower()
df['name_initial'] = df['names'].str[0]

Create a derived column using a custom function:

def low_score(row):
    if (row['Helpfulness'] < 2 and
        row['Empathy'] < 2 and
        row['Courtesy'] < 2):
        return 1
    return 0

data['Low Score'] = data.apply(low_score, axis=1)

Update columns which meet a condition with a specified value:

The code below updates all the missing values in the column Location to be N/A:

data.loc[ data['Location'].isnull(), ['Location'] ] = 'N/A'

The code below updates all the values in the column Amount that are less than 100 to be 150:

data.loc[ data['Amount'] < 100, ['Amount'] ] = 150

 

Filter data

Select rows based on simple condition and include all columns:

df[ df['customer_id'] == 1121 ]

Select rows based on more complex condition and include all columns:

Use & for logical and,  | for logical or, and ~ for negations.

df[ (df['customer_id'] > 1121) & (df['customer_id'] < 2232) ]

Combine indexing and filtering:

The code below will return the values in columns Letter and Animal of all the rows where the first character in the Animal column is greater than 'd' (i.e. 'e', 'f', 'g', ...):

df.loc[ df['Animal'].str[0] > 'd', ['Letter', 'Animal'] ]

The code below will extract the data in the columns whose names have more than 6 characters:

df.loc[:, df.columns.str.len() > 6]

 

Sort data 

Sort data in ascending order:

df.sort_values(by=['col1', 'col2'])

Sort data in ascending order, in place:

df.sort_values(by=['col1', 'col2'], inplace=True)

Sort data in descending order:

df.sort_values(by=["col_A"], ascending=False)

 Sort the index of a DataFrame:

df.sort_index()

 Sort the index of a DataFrame, in place:

df.sort_index(inplace=True)

Reset the index of a DataFrame

df.reset_index()

 

Rename column / row index labels

Rename columns:

df.rename(columns={"old_column_1": "new_column_1", "old_column_2": "new_column_2"})

Rename columns in place:

df.rename(columns={"old_column_1": "new_column_1", "old_column_2": "new_column_2"}, 
    inplace=True)

Rename row index labels:

In the code below, rows index label 0 is replaced with 1 and row index label 1 is replaced with 2.

 df.rename(index={0: 1, 1: 2})

 

Describe data

Display the first five columns of a DataFrame:

df.head()

Display the last five columns of a DataFrame:

df.tail()

Display the column index:

The code below will print an object containing the column names. It can be iterated over.

df.columns

Display the row index:

df.index

Display the shape of the data (number of rows / columns):

df.shape

Display basic information about the data:

df.info()

Display basic descriptive statistics:

df.describe()

Return the mean of each column as a Series:

df.mean()

Return the median of each column as a Series:

df.median()

Return the standard deviation of each column as a Series:

df.std()

Return the highest value from each column:

df.max()

Return the lowest value from each column:

df.min()

Return the Pearson correlation between columns:

df.corr(method='pearson')

 

Handle missing data

Check for missing data:

df.isnull().any()

Return the number of non-null values in each column:

df.count()

Remove rows with missing data:

df.dropna()

Remove columns with missing data:

df.dropna(axis=1)

Replace missing values with a default value (e.g. 0), in place:

data['Amount'].fillna(0, inplace=True)

 

Data types

Display the data type of each column:

df.dtypes

Change the data type of a column:

data['Country'] = data['Country'].astype('category')

 

*replace data_type with data types such as str, int, float, etc.