Data Engineering intermediate ⏱ 10–14 hours
Spark Transformations with a Testing Framework
Build a PySpark ETL pipeline structured like a production codebase, with unit tests, a test framework, and CI integration.
SparkPySparkTestingpytestCI/CD
View project on GitHub
What you’ll build
A PySpark transformation pipeline organized as a proper Python package, with pytest-based unit tests that run against a local Spark session. The structure mirrors what senior data engineers build in real companies: modular, testable, and CI-ready.
Skills you’ll practice
- Structuring PySpark code as a testable Python package
- Writing unit tests with pytest + local SparkSession
- Testing DataFrame transformations and schema assertions
- Integrating Spark tests into a CI pipeline