Surfalytics
All pet projects
Data Engineering intermediate ⏱ 8–12 hours

Apache Spark: Distributed Processing

Learn PySpark fundamentals by running distributed transformations locally with Docker, then scaling to a cloud cluster.

SparkPySparkDockerDistributed Computing
View project on GitHub

What you’ll build

PySpark transformation jobs running on a local Spark cluster via Docker, processing real datasets. Covers the core DataFrame API, partitioning, joins, and aggregations that show up in every data engineering interview.

Skills you’ll practice

  • PySpark DataFrame API: transformations, actions, schemas
  • Partitioning strategies and shuffle optimization
  • Running Spark in Docker (local cluster mode)
  • Understanding Spark’s execution model and DAG