Surfalytics
All pet projects
Data Engineering intermediate ⏱ 10–14 hours

Spark Transformations with a Testing Framework

Build a PySpark ETL pipeline structured like a production codebase, with unit tests, a test framework, and CI integration.

SparkPySparkTestingpytestCI/CD
View project on GitHub

What you’ll build

A PySpark transformation pipeline organized as a proper Python package, with pytest-based unit tests that run against a local Spark session. The structure mirrors what senior data engineers build in real companies: modular, testable, and CI-ready.

Skills you’ll practice

  • Structuring PySpark code as a testable Python package
  • Writing unit tests with pytest + local SparkSession
  • Testing DataFrame transformations and schema assertions
  • Integrating Spark tests into a CI pipeline