Airflow and Databricks: Shorten the Learning Curve with Prophecy

Airflow and Databricks: Shorten the Learning Curve with Prophecy

Maximize productivity without the complexity.

Maximize productivity without the complexity.

Pooja Singhal
Assistant Director of R&D
Texas Rangers Baseball Club
July 25, 2024

Table of Contents

In today's dynamic data processing landscape, orchestrating complex workflows efficiently is critical. Solutions such as Databricks Workflows are great for scheduling jobs within Databricks itself, but their features are optimized specifically to that platform. Oftentimes, users require additional capabilities, especially when they are in the process of moving away from legacy tools like Alteryx and integrating with modern cloud data solutions in the ecosystem. In these cases, Apache Airflow emerges as a solution. Recognized as a powerful open-source platform, Airflow empowers organizations of all sizes to schedule, monitor, and manage intricate data workflows, including pipelines, ETL processes, and task automation in a variety of data architectures.

The challenge with mastering Airflow

Mastering Airflow can be challenging due to its Python-centric architecture, which revolves around concepts like Directed Acyclic Graphs (DAGs), Operators, and Executors. Airflow comes with a steep learning curve, especially for data users required to write Python DAGs. Airflow offers a wide range of extendibility by writing your custom operator options to tailor workflows to specific use cases. While writing custom operators allows for tailored functionality, the process can be challenging due to the need for advanced Python programming skills and a deep understanding of Airflow's internal workings.

Streamlining Workflow Development

Prophecy’s Airflow visual canvas greatly simplifies the process of developing Airflow DAGs (which are collections of tasks you plan to run) and enhances productivity for data engineers and analysts. The UI provides a one-stop, end-to-end solution for orchestrating Airflow Jobs through a drag-and-drop visual interface that is easy for all data users to master. Behind the scenes, Prophecy translates these visual jobs into highly optimized Python Airflow code, stored in Git, eliminating vendor lock-in while also ensuring complete accessibility and openness to all users. Moreover, users have the flexibility to enhance functionality by incorporating their custom operators and sensors using Prophecy’s Gem Builder interface

What sets Prophecy apart is its visual approach, designed to minimize the learning curve and accelerate time-to-value for users. With intuitive interfaces and pre-built Gems, users can quickly configure jobs, reducing development effort and enabling rapid iteration.

Users can develop, test, schedule and monitor their Airflow Jobs all in one interface. The generated DAG code is managed in the user’s git, offering no vendor lock-in and data professionals are free to use their own CI/CD tools for managing DAGs in production.

Setting Up Airflow with Databricks in Prophecy: A Step-by-Step Guide

Let's walk through the process of setting up Airflow with Databricks in Prophecy. You can integrate Prophecy with any of your managed Airflow instances like Amazon Managed Workflows (MWAA), GCP’s Cloud Composer, and more. However, if you're new to Airflow and don't have an Airflow instance running in your environment, Prophecy offers a Managed Airflow to expedite your trial and Proof of Concept (POC). This eliminates the hassle of setting up and managing your own Airflow. You can utilize it to connect to your Spark or SQL execution environment and experiment with scheduling for your Spark Pipelines or SQL Models.

By the end of this section, you'll be able to utilize our visual interface to create your DAG and schedule it on Airflow swiftly. Let's delve into it!

Step 1: Setup Prophecy Fabric for Airflow

Prophecy connects to your Spark, SQL, or Airflow environments through Fabrics. Create a Fabric, which is a logical execution environment, and link it to your Airflow Instance. When connecting to Prophecy Managed Airflow, you don't need to provide any authentication details as authentication and authorization are automatically handled.

Step 2: Adding Databricks and Other Connections

To enable Airflow to communicate with your Spark, SQL, or any other third-party tools, set up connections. In Prophecy Managed Airflow, you can quickly and easily create connections directly through our UI. Simply add a connection to your Fabric and provide the authentication details. To learn more about Prophecy Fabrics, click here

For Databricks, you can select existing Databricks Fabrics. 

Further details for various connection types can be found in our documentation.

Step 3: Create an Airflow Job and Add Your Gems:

Now, you can create a new Job and add Gems for different tasks. Drag and drop a Databricks Pipeline Gem to execute a Databricks Pipeline.

Select your Pipeline to run. Choose the Fabric and Job size for execution. You may also override any Pipeline configurations as needed.

Step 4: Run and Debug your job:

While constructing your Job, you can easily test it by clicking on the play button. 

Once you have done this, the UI displays the outputs and logs of any task, facilitating their utilization in downstream tasks.

Step 5: Schedule and Monitor

Once you've developed and tested your Job, enable the Job and commit your changes. You can commit, merge, and release directly using the user interface to schedule your Job on Airflow or use the Prophecy Build Tool to release it with your own CI/CD tools. 

Additionally, you can monitor all scheduled runs through the Observability tab. This offers a view of past and current runs of any Jobs released on a Fabric. Navigate between Attention Required, All Events, and Job Runs to access specific runs. You can also review the logs of any run. The logs will help you to debug any issues.

Summary 

While the integration of Apache Airflow with Databricks offers efficiency and reliability in orchestrating data pipelines, it's important to note that Airflow's adoption often requires navigating a steep learning curve as it requires users to have a level of proficiency coding in Python. 

Powered by a visual interface on top of Airflow, Prophecy offers a streamlined solution for orchestrating complex data workflows. Prophecy Airflow empowers users to overcome the challenges of setting up and managing Airflow instances, enabling efficient scheduling, monitoring, and management of data pipelines. With the ability to easily create and schedule Airflow Jobs, construct DAG-based workflows, add custom tasks, and monitor job runs, Prophecy simplifies the process of data pipeline management, empowering data teams to focus on driving insights and innovation. Whether it's executing Databricks Pipelines or integrating with other tools, Prophecy provides a unified platform for end-to-end data pipeline orchestration, enhancing productivity and scalability for organizations of all sizes.

Ready to give Prophecy a try?

You can create a free account and get full access to all features for 21 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.

Ready to give Prophecy a try?

You can create a free account and get full access to all features for 14 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.

Get started with the Low-code Data Transformation Platform

Meet with us at Gartner Data & Analytics Summit in Orlando March 11-13th. Schedule a live 1:1 demo at booth #600 with our team of low-code experts. Request a demo here.

Related content

PRODUCT

A generative AI platform for private enterprise data

LıVE WEBINAR

Introducing Prophecy Generative AI Platform and Data Copilot

Ready to start a free trial?

Visually built pipelines turn into 100% open-source Spark code (python or scala) → NO vendor lock-in
Seamless integration with Databricks
Git integration, testing and CI/CD
Available on AWS, Azure, and GCP
Try it Free

Lastest blog posts

Data Lakehouse

Load Data into Databricks Delta Lake in 5 Minutes

Shashank Mishra
August 29, 2024
August 29, 2024
August 29, 2024
Events

Delivering AI and Data Value at CDOIQ 2024

Matt Turner
July 25, 2024
July 25, 2024
July 25, 2024
Events

Databricks Data + AI Summit 2024: AI Shapes a Copilot Future

Raveena Kapatkar
June 20, 2024
June 20, 2024
June 20, 2024