Surfalytics
All pet projects
Data Engineering beginner ⏱ 4–6 hours

PyArrow: High-Performance Data Processing

Discover PyArrow for fast columnar data processing in Python — significantly faster than Pandas for large datasets and Parquet files.

PyArrowPythonParquetPerformance
View project on GitHub

What you’ll build

A set of data processing scripts using PyArrow to read, transform, and write Parquet files — benchmarked against Pandas to see the performance difference firsthand. Relevant for pipelines that handle millions of rows.

Skills you’ll practice

  • PyArrow Tables, Schemas, and data types
  • Reading and writing Parquet files efficiently
  • Arrow-native computation vs Pandas interop
  • Understanding columnar storage and why it matters for analytics