How to Serve Custom AI Models at Scale Using Modal

Ship custom ML and generative models without wrestling with DevOps. In this hands-on workshop, you’ll use Modal — a serverless compute platform for AI and data workloads — to define infrastructure in Python, run on-demand CPU/GPU hardware, and scale from one request to hundreds of workers in seconds. You’ll learn Modal’s core concepts, compare CPU vs. GPU execution, explore parallelism patterns, and deploy a production-style Stable Diffusion inference pipeline.

FREE

Upcoming cohort

Nov 26, 2025

Meeting time

Wednesday, 12:00PM ET
How to Serve Custom AI Models at Scale Using Modal

What will I learn?

  • Explain why model serving is mostly an infrastructure problem and how Modal abstracts it
  • Define infrastructure as code with modal.App and modal.Image
  • Choose and attach the right hardware and reason about costs & burst scaling
  • Persist and share artifacts using Modal Volumes, and mount external storage
  • Securely manage secrets and environment configuration
  • Implement map-based and function-level parallelism for speedups on CPU workloads
  • Run PyTorch inference on GPU and measure CPU vs. GPU performance differences
  • Build a hybrid workflow that orchestrates locally and executes heavy steps remotely
  • Deploy a Stable Diffusion text-to-image service on Modal with cached model weights for fast cold starts

Curriculum

Why Infra is Hard for AI

Provisioning, dependency hell, and scaling challenges

Modal Fundamentals

Apps, Images, functions, local entrypoints, secrets, and tokens

Data & State

Volumes for weights/datasets; mounting external stores; caching strategies

Parallelism Patterns

Map-based vs. function-level parallelism and when to use each

Hands-On Deployments

From simple remote functions to PyTorch GPU and Stable Diffusion on Modal

Hardware & Scaling

CPU vs. GPU selection, pricing, and burst capacity

Why Edlitera?

Build the coding, data and AI skills you need, online, on your own schedule. From learning to code as a beginner to mastering cutting-edge data science, machine learning and AI techniques.

Learning for the real world

Our courses are made with the input and feedback of top teams at Fortune 500 companies in Silicon Valley and on Wall Street.

No-fluff learning

Each minute of each course is packed full of insight, best practices and real-world experience from our expert instructors.

Learn by doing

Start writing code on your computer from Day One. Practice on hundreds of exercises. Apply your skills in mini-projects. Get instant feedback from video solutions.

Complete learning tracks

With over 150 hours of video lectures and hundreds of practice exercises and projects, our learning tracks will help you level up your skills whether you are a novice or an advanced learner.

What people are saying

"I walked into the bootcamp with some basic Python syntax and walked out with a much stronger, contextualized grasp of Python, an understanding of common mistakes, the ability to solve basic coding problems, and confidence in my ability to learn more."

Randi S., Edlitera Student
Randi S., a graduate of Edlitera's Python training bootcamp

"I wanted to learn Python and be able to process data without being tied and limited by Excel and macros. These classes gave me all the tools to do so and beyond. The materials provided, the engagement of the class by the tutors and their availability to help us were excellent."

Gaston G., Edlitera Student
Gaston G., a graduate of Edlitera's Python training bootcamp

Course Syllabus

1. The Problem Space: Serving Models at Scale

  • Provisioning clouds/VMs, dependency conflicts, networking & security
  • Kubernetes complexity vs. developer velocity

2. What Is Modal & How It Works

  • Serverless compute for AI/data; infra defined in Python
  • App groups functions; Image defines the container environment
  • Fast, secure sandboxes; launch times often sub-second (gVisor isolation)
  • On-demand hardware
  • How billing works

3. Getting Started

  • Environment setup, modal token new, sign in with Google/GitHub
  • Default free credits; using the CLI and project scaffolding
  • Best practice: local orchestration + remote execution

4. Core Concepts by Example

  • App & Image: build a lightweight Image (e.g., requests, beautifulsoup4)
  • Remote function: fetch a page title on Modal; compare local vs. remote execution
  • How to handle secrets
  • Volumes: cache model weights/datasets to avoid repeated downloads; mounting S3

5. Choosing the Right Hardware

  • When to use CPUs vs. GPUs
  • Cost/perf tradeoffs; matching task profile to instance type
  • Measuring gains: timing utilities, warmups, and GPU synchronization gotchas

6. Parallelism Patterns in Modal

  • Map-based parallelism
  • Function-level parallelism
  • Understanding the Python GIL and when parallel workers beat bigger machines

7. Hands-On: CPU & GPU Demos

  • CPU scaling demo
  • PyTorch CPU vs. GPU

8. Deploying a Generative Model: Stable Diffusion on Modal

  • Create a Volume for SD weights; one-off upload to cache artifacts
  • Add a Hugging Face token as a Secret; construct a diffusers pipeline
  • Attach hardware to the function; return a PIL image
  • Latency tips: model initialization strategies, keeping workers warm, batching & concurrency

9. Operating Considerations

  • Structuring repos for Modal apps; environments and reproducibility
  • Cost controls: right-sizing CPU/GPU, pay-per-second awareness, job timeouts
  • Safety & reliability: timeouts, retries, input validation, resource scoping
  • Extending to APIs: wrapping Modal functions behind HTTP endpoints; versioning and rollbacks

Have a question?

Contact us any time, we'd love to hear from you!