Data ManagementETLCloud ServicesStrategyBusinessTechnicalEnterprise

3 Key Questions About Programmatic ETL and Apache Airflow, Answered

By August 16, 2018 No Comments

The task at hand was relatively straightforward – find an effective way to move data from “point A to point B”. However, there was a caveat. The source was not some flavor of database table or excel spreadsheet, but rather, a REST API. Immediately, I began to explore the various ETL options commonly found in enterprise deployments. However, none of the traditional ETL tools in my toolkit seemed to fit the bill, from both cost and speed-to-value standpoints. With no tool in sight, I determined that the best option may be to write some code.

Given that the majority of my development experience had been in Python, I began to scour the internet for any Python ETL recommendations. Among the large landscape of Medium articles and Stack Overflow posts, I kept seeing the same library appear over and over: Airflow. Intrigued, I began the deep dive.

For those of you unfamiliar with Airflow, let me do my best to give you the 30,000-foot overview of this programmatic ETL tool. First and foremost, Airflow is essentially a polished scheduler. Airflow allows you to construct, schedule, and monitor your workflows which are written entirely in custom Python code. Users construct their workflows as DAGS (Directed Acyclic Graphs) comprised of Python-tasks that are executed by Airflow’s built-in scheduler. In addition to a rich command line interface, Airflow also provides users with a built-in web app that assists with visualizing pipelines, monitoring tasks, and debugging processes. Because your workflows and tasks are coded entirely in Python code, your only constraints are the limits of Python itself. If it can be coded in Python, it can be incorporated into your Airflow orchestration pipeline.

Satisfied with my findings, I began developing my first Airflow pipeline comprised of three separate Python tasks:

  1. Extract the data from a REST API
  2. Save the API’s JSON output in Azure Blob Storage
  3. Migrate and transform the raw JSON from Azure Blob storage to a SQL Server Database

As I moved my fledgling Airflow pipeline into production, three questions permeated through my mind.

One: Did I feel Airflow was as capable as other ETL market alternatives?

Yes. Despite still residing in the Apache incubator, Airflow appeared very mature for its age. It had quality documentation, testing libraries, and features. Sure it had its quirks, but those are all part of the growing pains associated with a tool that remains open-source and resides in incubation. However, if you were to ask me to recommend a tool to organize and execute custom, Python ETL processes, Airflow immediately jumps to the top of my list. It continues to grow in the market with Google Cloud now offering a managed version of Airflow – Cloud Composer – removing the need to manage any infrastructure on your own. While, the next big release of Airflow will incorporate Kubernetes container management, positioning Airflow as a top choice for scalable ETL solutions.

Two: What kind of business would benefit from Airflow?

There really isn’t a single answer to this question. It’s in my opinion that there are some organizations that are better suited to manage and develop an entirely custom ETL solution. For example, an organization with a large number of programmers on staff would be able to stand up, test, and monitor an Airflow ETL setup with ease. If you have a team of Python developers on hand, why spend money to license an ETL tool that they may not have used before? For this type of organization, Airflow better leverages your developers strengths.

However, with this being said,  Airflow does not necessarily fulfill the “set and forget” types of ETL systems that companies with fewer technical employees tend to gravitate towards. For those companies, Airflow may serve as a great companion product to your legacy ETL system. For example, Airflow could be a tiny piece of your entire ETL process that only extracts and archives the data from your REST API’s. From there, you could pump your extracted data though your legacy ETL system, providing you with the best of both worlds, the flexibility of custom code with the rigidness of a more mature process.

Three: If I had to construct this process again would I do it the same?

On a prototype? Absolutely. On a revision of my project? Probably not — or at least not alone. This is by no means an assault on Airflow, but rather, nods to the notion that there isn’t necessarily a one-fit-all solution to a problem. Even upon a revision, I would still use Airflow to extract and archive my data from the REST API. From there, I would leverage an additional ETL tool during my data transformation and migration phase. This is by no-means the correct solution for everyone, but it is the solution that best leverages my strengths.

Compared to the other programmatic ETL tools on the market, Airflow continues to strengthen its market position by offering users a variety of unique features.  Due to its open source nature and growing-user adoption, I imagine that Airflow’s features will only become more robust in the near future, allowing Airflow to maintain its place as a top option for data-engineers.

Need help finding the best approach to moving your data? Contact us to learn more about the latest tools and techniques.