Data Engineering intermediate ⏱ 8–12 hours

Apache Spark: Distributed Processing

Learn PySpark fundamentals by running distributed transformations locally with Docker, then scaling to a cloud cluster.

SparkPySparkDockerDistributed Computing

What you’ll build

PySpark transformation jobs running on a local Spark cluster via Docker, processing real datasets. Covers the core DataFrame API, partitioning, joins, and aggregations that show up in every data engineering interview.

Skills you’ll practice

PySpark DataFrame API: transformations, actions, schemas
Partitioning strategies and shuffle optimization
Running Spark in Docker (local cluster mode)
Understanding Spark’s execution model and DAG