Adventures in Data Engineering : GitHub Actions & CI/CD (part 1)

Ryan Howe
7 min readNov 16, 2023

Intro

I started doing data work as an analyst and while the team I worked on had our own GitHub repo most of the team didn’t use it. Most the team was relatively new to data work and there wasn’t much need outside of some reports and a process used for finance/accounting. Eventually I moved on and became a data engineer and version control and CI/CD are now much more relevant and needed.

When I worked at Facebook a lot of this was pretty seamless and owned by other teams. Facebook used a number of different tools like Mercurial and Phabricator for version control along with something called Sandcastle (https://engineering.fb.com/2017/08/31/web/rapid-release-at-massive-scale/) for test automation. This wasn’t as important to data engineering testing. Data engineering had an extension within VS Code called the Data Workbench (Airflow recently released something similar and I’m spacing on the name). Anyway, my point is I never had to set this stuff up before and I’ve been looking into doing that.

This will be about setting up some basic GitHub workflow for Python to help with formatting, linting and testing. I was able to pretty easily find some similar articles on Medium however I found that they were out of date and didn’t include everything so I’m going to walk through some of that to show the issues and my process.

Getting Started

--

--