How to Serve Custom AI Models at Scale Using Modal
Ship custom ML and generative models without wrestling with DevOps. In this hands-on workshop, you’ll use Modal — a serverless compute platform for AI and data workloads — to define infrastructure in Python, run on-demand CPU/GPU hardware, and scale from one request to hundreds of workers in seconds. You’ll learn Modal’s core concepts, compare CPU vs. GPU execution, explore parallelism patterns, and deploy a production-style Stable Diffusion inference pipeline.
FREE
Upcoming cohort
Nov 26, 2025Meeting time
Wednesday, 12:00PM ETWhat will I learn?
- Explain why model serving is mostly an infrastructure problem and how Modal abstracts it
- Define infrastructure as code with modal.App and modal.Image
- Choose and attach the right hardware and reason about costs & burst scaling
- Persist and share artifacts using Modal Volumes, and mount external storage
- Securely manage secrets and environment configuration
- Implement map-based and function-level parallelism for speedups on CPU workloads
- Run PyTorch inference on GPU and measure CPU vs. GPU performance differences
- Build a hybrid workflow that orchestrates locally and executes heavy steps remotely
- Deploy a Stable Diffusion text-to-image service on Modal with cached model weights for fast cold starts
Curriculum
Why Infra is Hard for AI
Modal Fundamentals
Data & State
Parallelism Patterns
Hands-On Deployments
Hardware & Scaling
Why Edlitera?
Build the coding, data and AI skills you need, online, on your own schedule. From learning to code as a beginner to mastering cutting-edge data science, machine learning and AI techniques.

Learning for the real world
Our courses are made with the input and feedback of top teams at Fortune 500 companies in Silicon Valley and on Wall Street.
No-fluff learning
Each minute of each course is packed full of insight, best practices and real-world experience from our expert instructors.
Learn by doing
Start writing code on your computer from Day One. Practice on hundreds of exercises. Apply your skills in mini-projects. Get instant feedback from video solutions.
Complete learning tracks
With over 150 hours of video lectures and hundreds of practice exercises and projects, our learning tracks will help you level up your skills whether you are a novice or an advanced learner.
What people are saying
"I walked into the bootcamp with some basic Python syntax and walked out with a much stronger, contextualized grasp of Python, an understanding of common mistakes, the ability to solve basic coding problems, and confidence in my ability to learn more."
Randi S., Edlitera Student
"I wanted to learn Python and be able to process data without being tied and limited by Excel and macros. These classes gave me all the tools to do so and beyond. The materials provided, the engagement of the class by the tutors and their availability to help us were excellent."
Gaston G., Edlitera Student
Course Syllabus
1. The Problem Space: Serving Models at Scale
- Provisioning clouds/VMs, dependency conflicts, networking & security
- Kubernetes complexity vs. developer velocity
2. What Is Modal & How It Works
- Serverless compute for AI/data; infra defined in Python
- App groups functions; Image defines the container environment
- Fast, secure sandboxes; launch times often sub-second (gVisor isolation)
- On-demand hardware
- How billing works
3. Getting Started
- Environment setup, modal token new, sign in with Google/GitHub
- Default free credits; using the CLI and project scaffolding
- Best practice: local orchestration + remote execution
4. Core Concepts by Example
- App & Image: build a lightweight Image (e.g., requests, beautifulsoup4)
- Remote function: fetch a page title on Modal; compare local vs. remote execution
- How to handle secrets
- Volumes: cache model weights/datasets to avoid repeated downloads; mounting S3
5. Choosing the Right Hardware
- When to use CPUs vs. GPUs
- Cost/perf tradeoffs; matching task profile to instance type
- Measuring gains: timing utilities, warmups, and GPU synchronization gotchas
6. Parallelism Patterns in Modal
- Map-based parallelism
- Function-level parallelism
- Understanding the Python GIL and when parallel workers beat bigger machines
7. Hands-On: CPU & GPU Demos
- CPU scaling demo
- PyTorch CPU vs. GPU
8. Deploying a Generative Model: Stable Diffusion on Modal
- Create a Volume for SD weights; one-off upload to cache artifacts
- Add a Hugging Face token as a Secret; construct a diffusers pipeline
- Attach hardware to the function; return a PIL image
- Latency tips: model initialization strategies, keeping workers warm, batching & concurrency
9. Operating Considerations
- Structuring repos for Modal apps; environments and reproducibility
- Cost controls: right-sizing CPU/GPU, pay-per-second awareness, job timeouts
- Safety & reliability: timeouts, retries, input validation, resource scoping
- Extending to APIs: wrapping Modal functions behind HTTP endpoints; versioning and rollbacks
Have a question?
Contact us any time, we'd love to hear from you!