The core offering of Surfalytics is a 12-module course that contains everything you need to know about data engineering, business intelligence, and analytics.
The course includes both theoretical and practical lessons. They are organized in order of complexity, starting from the simplest and progressing to more advanced topics. Each new piece of information will build upon existing knowledge.
The course also provides numerous links to external materials, training, and books. The primary goal is to give you foundational knowledge and empower you to succeed in your data career.
Let’s discuss the requirements and the course further:
First and foremost, you need to determine the requirements for comfortable data work when taking the course. I can identify several key components:
- Internet access 😀
- Preferably a screen size of 15 inches or larger
- Ideally, 16 GB of RAM (minimum 8 GB), or else it will lag
- Operating systems: Windows and Mac. Linux will also work
- To access AWS, you might need to enter a credit card number during registration (not before the 4th module).
- Telegram - where you can join our chat
- A GitHub account (we’ll explain in the first homework how to set up GitHub and its use)
- English proficiency at a reading level
- Ability to use Google and ChatGPT 😀
- Presence on social media to share about the course 😀
Getting Started with Analytics (Data) Engineering - this course is about my job as a data engineering and analytics individual contributor and my 14+ years of experience creating analytical solutions in Canada, the USA, Russia, and Europe. If I were to hire a data engineer or BI engineer, I would want them to have the knowledge and competencies that we’ll cover in the course. The course includes basics like Business Intelligence tools, databases, ETL tools, cloud computing, and much more.
Even if you have no experience with data, that won’t be a hindrance. The first few modules will focus on the basics of analytics and classic tasks: Business Intelligence (reporting, visualization, data warehousing, SQL, Excel, data integration). This will be enough for roles like BI developer, data analyst, etc. Starting from the 5th-6th module, we will delve directly into the work of a Data Engineer, building on the knowledge obtained in the initial stages.
The course consists of 12 modules:
Module 1: The Role of Analytics and Data Engineering in the Organization Let’s get acquainted with the subject of study. We’ll find out who the Data Engineer is, what other tech roles exist, what they do, and their other titles. Most importantly, we’ll understand how they help businesses be more efficient and make money. We’ll examine the typical architectures of analytical solutions.
Module 2: Databases and SQL We will look at a solution for local analytics. We’ll familiarize ourselves with databases and understand their advantages for working with data compared to Excel/Google Sheets. We’ll practice SQL, set up a database, and upload data to it. Then we’ll use Excel/Google Sheets for data visualization.
Module 3: Data Visualization, Dashboards, and Reporting - Business Intelligence (BI) We’ll get acquainted with BI tools and learn how to use Tableau, Looker, and Power BI. We’ll delve into the client and server aspects. We’ll discuss the tasks and theory of data visualization and real examples of BI solution implementations. We’ll also get to know the methodology for creating metrics - Pirate Metrics.
Module 4: Data Integration and Creating Data Pipelines As the number of data sources grows, it becomes challenging to manually upload and transform data. ETL solutions are used for these tasks. We will also discuss the difference between ETL and ELT. Furthermore, we’ll explore the market solutions and practice with an Open Source solution, using which we’ll be able to load data into Redshift and automate this process.
Module 5: Cloud Computing We will find out what is behind the concept of cloud computing, how it’s used in the West, and why it’s so popular. We’ll familiarize ourselves with the analytical solutions of Amazon Web Services and Microsoft Azure. We will look at real-life examples of cloud migration.
Module 6: Cloud Data Storage In analytics, the center of the universe is usually the data warehouse or data platform. Typically, this is an analytical solution with MPP architecture, and cloud solutions are often used. We’ll acquaint ourselves with one of the most popular solutions, Amazon Redshift, and learn about other similar platforms. We’ll also discuss case studies of migrating traditional solutions to the cloud.
Module 7: Introduction to Apache Spark Apache Spark is one of the most popular tools for Data Engineers. This module is dedicated to acquainting ourselves with Apache Spark and examining its functionality. We will practice creating RDDs and Data Frames, and consider key operations and use cases.
Module 8: Creating Big Data Solutions Using Hadoop ecosystem Hadoop is the flagship of Big Data solutions. In this module, we will tackle a problem that cannot be handled by traditional ETL/DW tools, helping you understand the difference between DW and BigData. You will know why we use Hadoop. As a management tool, we will use Spark, which will already be pre-installed on Amazon Elastic Map Reduce. For the exercise, we’ll utilize PySpark to read unstructured logs and extract valuable information from them.
Module 9: Introduction to the Data Lake There are many versions of the purpose of a Data Lake and its role in the Analytical ecosystem. In this module, we will familiarize ourselves with the concept of a Data Lake, and its role in the ecosystem, and look at typical architectures for building solutions using a Data Lake. We will use AWS and Azure clouds.
Module 10: Getting Started with Streaming This module delves deep into the challenges and solutions associated with data streaming. As data becomes more real-time and voluminous, handling continuous data streams is crucial. Participants will understand the nuances of streaming data, and the importance of timely data processing, and get hands-on experience with tools and technologies that facilitate efficient data streaming.
Module 11: Machine Learning Tasks Through the Eyes of a Data Engineer Data Engineering and Machine Learning are interlinked in many modern analytical workflows. This module helps participants view Machine Learning tasks from a data engineering perspective. It covers the challenges of preparing large datasets for training, efficient storage and retrieval systems for ML models, and ensuring that data pipelines are ML-ready.
Module 12: Best Practices of a Data Engineer A comprehensive guide to becoming an exemplary data engineer, this module touches upon both the technical and non-technical aspects of the role. From employing DevOps, CI/CD, and Infrastructure as Code (IaC) in data operations to developing soft skills for effective communication and teamwork, this module rounds off with a discussion on potential career trajectories and growth opportunities in the field of data engineering.