PySpark Cheat Sheet

A brief list of common PySpark methods and how to use them.
By Ciprian Stratulat • Sep 8, 2021

Set Up

Set Up PySpark 1.x

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext


Create a SparkContext:

sc = SparkContext()


Create a SQLContext:

sc = SparkContext()
sql_context = SQLContext(sc)


Create a HiveContext:

sc = SparkContext()
hive_context = HiveContext(sc)


Set Up PySpark 2.x

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()


Set Up PySpark on AWS Glue

from pyspark.context import SparkContext
from awsglue.context import GlueContext

glueContext = GlueContext(SparkContext.getOrCreate())


Load Data

Create a DataFrame from RDD

Create a DataFrame using the .toDF() function:

population = [ ("Croatia", 4_058_000), ("Oregon", 4_218_000 ) ]

rdd = spark.sparkContext.parallelize(population)

cols = ["state", "population"]

df = rdd.toDF(rdd_columns)


Create a DataFrame using the createDataFrame() function:

population = [ ("Croatia", 4_058_000), ("Oregon", 4_218_000 ) ]

cols = ["state", "population"]

df = spark.createDataFrame(data=population, schema=cols)


Create a DataFrame using a combination of the createDataFrame() function and StructType schema:

population = [ ("Croatia", 4_058_000), ("Oregon", 4_218_000 ) ]

pop_schema = StructType([
    StructField("state", StringType(), True),	
    StructField("population", IntegerType(), True)])

df = spark.createDataFrame(data=population, schema=pop_schema)


Create a DataFrame from a Spark Data Source

Load a CSV file:

df ="sport.csv", sep=";", header=True, inferSchema=True)


Read a TXT file:

df ="names.txt")


Read a JSON file:

df ="fruits.json", format="json")


Read a PARQUET file:

df ="stock_prices.parquet")


df ="stock_prices.parquet")


Create a Glue DynamicFrame

dfg = glueContext.create_dynamic_frame.from_catalog(database="example_database", table_name="example_table")
spark_df = dfg.toDF()


Write Data

Write Data from a DataFrame in PySpark

df_modified.write.json("fruits_modified.jsonl", mode="overwrite")


Convert a DynamicFrame to a DataFrame and Write Data to AWS S3 Files

dfg = glueContext.create_dynamic_frame.from_catalog(database="example_database", table_name="example_table")


Repartition into one partition and write:

df = dfg.toDF().repartition(1)


Repartition by a column and write:

dfg.toDF().write.parquet("s3://glue-sample-target/outputdir/dfg", partitionBy=["example_column"])


Convert a DataFrame to a DynamicFrame and Write Data to AWS S3 Files

dfg = DynamicFrame.fromDF(df, glueContext, "dfg")

    connection_options={"path": "s3://glue-sample-target/outputdir/dfg"},


Inspect Data

Display Content

Display DataFrame content:


Display DataFrame schema:



Display DataFrame as a Pandas DataFrame:



Return DataFrame columns:



Return the first n rows of a DataFrame:



Return the first row of a DataFrame:



Display DynamicFrame schema:



Display DynamicFrame content by converting it to a DataFrame:



Analyze Content

Generate a basic statistical analysis of a DataFrame:


Count the number of rows inside a DataFrame:



Count the number of distinct rows:



Print the logical and physical plans:



Add, Remove, and Update Columns

Add Columns

Add columns with Spark native functions:

import pyspark.sql.functions as f

new_df = df.withColumn("column_3_multiplied", 3 * f.col("column_3_original"))


Add columns with user defined functions (UDFs):

import pyspark.sql.functions as f
from psyspark.sql.types import *

def example_func(filter_value):
	if values >= 5:
		return "enough free spots"
		return "not enough free spots"

my_udf = f.udf(example_func, StringType())

cinema_tickets = cinema.withColumn("free_spots", my_udf("spots") ) 


Remove Columns

Remove columns using column names:

sports = df.drop("football", "basketball")


Remove columns using chaining:

sports = sports.drop(


Modify Columns

Rename column:

df = df.withColumnRenamed("basketball", "BASKETBALL")


Remove duplicates based on data in a column:



Remove rows with missing values based on columns in the DataFrame:["basketball", "football"])


Impute missing data:


Select and Modify Data

Select Data

Select a single column:"basketball")


Select multiple columns:"basketball", "football")


Select a filtered version of a column:["goals"] >= 2)


Select a modified version of a column:["goals"] + 1)


Select Data with Conditional Arguments

Select using a "when otherwise" clause:"goals", f.when(df.goals == 0, "boring").otherwise("interesting"))


Select using "like":"sport","basketball"))


Select using "between":, 3))


Select using "startswith" or "endswith":"sports", df.players.startwith("B"))"s"))


Select a substring:, 4).alias("nickname"))


Group Data

Group data:



Group and aggregate data:

df.groupby("players").agg(spark_max("goals"), spark_min("goals"), spark_sum("goals").alias("total_goal_num")).show()


Filter Data

df.filter(df["goals"] > 3)


Sort Data

df.sort("goals", ascending=True).collect()




df.orderBy(["goals"], ascending = [0,1]).collect()


Repartition Data

Create multiple partitions:



Create a single partition:



Perform Joins

Perform an inner join:

df = df_1.join(df_2, on=["key"], how="inner")


Perform an inner join with conditions:

df = df_1.join(df_2, df_1.key < df_2.key, how="inner")


Perform an outer join:

df = df_1.join(df_2, on=["key"], how="outer")


Perform a left join:

df = df_1.join(df_2, on=["key"], how="left")


Perform a right join:

df = df_1.join(df_2, on=["key"], how="right")


Perform a left semi join:

df = df_1.join(df_2, on=["key"], how="left_semi")


Perform a left anti join:

df = df_1.join(df_2, on=["key"], how="left_anti")


Query Data

    SELECT *
    FROM people
        INNER JOIN places
            ON = LOWER(places.location)
Ciprian Stratulat

CTO | Software Engineer

Ciprian Stratulat

Ciprian is a software engineer and the CTO of Edlitera. As an instructor, Ciprian is a big believer in first building an intuition about a new topic, and then mastering it through guided deliberate practice.

Before Edlitera, Ciprian worked as a Software Engineer in finance, biotech, genomics and e-book publishing. Ciprian holds a degree in Computer Science from Harvard University.