Data Engineering intermediate ⏱ 8–12 hours
Apache Spark: Distributed Processing
Learn PySpark fundamentals by running distributed transformations locally with Docker, then scaling to a cloud cluster.
SparkPySparkDockerDistributed Computing
View project on GitHub
What you’ll build
PySpark transformation jobs running on a local Spark cluster via Docker, processing real datasets. Covers the core DataFrame API, partitioning, joins, and aggregations that show up in every data engineering interview.
Skills you’ll practice
- PySpark DataFrame API: transformations, actions, schemas
- Partitioning strategies and shuffle optimization
- Running Spark in Docker (local cluster mode)
- Understanding Spark’s execution model and DAG