Airflow and Databricks: Shorten the Learning Curve with Prophecy
Airflow and Databricks: Shorten the Learning Curve with Prophecy
Maximize productivity without the complexity.
Maximize productivity without the complexity.
Table of Contents
In today's dynamic data processing landscape, orchestrating complex workflows efficiently is critical. Solutions such as Databricks Workflows are great for scheduling jobs within Databricks itself, but their features are optimized specifically to that platform. Oftentimes, users require additional capabilities, especially when they are in the process of moving away from legacy tools like Alteryx and integrating with modern cloud data solutions in the ecosystem. In these cases, Apache Airflow emerges as a solution. Recognized as a powerful open-source platform, Airflow empowers organizations of all sizes to schedule, monitor, and manage intricate data workflows, including pipelines, ETL processes, and task automation in a variety of data architectures.
The challenge with mastering Airflow
Mastering Airflow can be challenging due to its Python-centric architecture, which revolves around concepts like Directed Acyclic Graphs (DAGs), Operators, and Executors. Airflow comes with a steep learning curve, especially for data users required to write Python DAGs. Airflow offers a wide range of extendibility by writing your custom operator options to tailor workflows to specific use cases. While writing custom operators allows for tailored functionality, the process can be challenging due to the need for advanced Python programming skills and a deep understanding of Airflow's internal workings.
Streamlining Workflow Development
Prophecy’s Airflow visual canvas greatly simplifies the process of developing Airflow DAGs (which are collections of tasks you plan to run) and enhances productivity for data engineers and analysts. The UI provides a one-stop, end-to-end solution for orchestrating Airflow Jobs through a drag-and-drop visual interface that is easy for all data users to master. Behind the scenes, Prophecy translates these visual jobs into highly optimized Python Airflow code, stored in Git, eliminating vendor lock-in while also ensuring complete accessibility and openness to all users. Moreover, users have the flexibility to enhance functionality by incorporating their custom operators and sensors using Prophecy’s Gem Builder interface
What sets Prophecy apart is its visual approach, designed to minimize the learning curve and accelerate time-to-value for users. With intuitive interfaces and pre-built Gems, users can quickly configure jobs, reducing development effort and enabling rapid iteration.
Users can develop, test, schedule and monitor their Airflow Jobs all in one interface. The generated DAG code is managed in the user’s git, offering no vendor lock-in and data professionals are free to use their own CI/CD tools for managing DAGs in production.
Setting Up Airflow with Databricks in Prophecy: A Step-by-Step Guide
Let's walk through the process of setting up Airflow with Databricks in Prophecy. You can integrate Prophecy with any of your managed Airflow instances like Amazon Managed Workflows (MWAA), GCP’s Cloud Composer, and more. However, if you're new to Airflow and don't have an Airflow instance running in your environment, Prophecy offers a Managed Airflow to expedite your trial and Proof of Concept (POC). This eliminates the hassle of setting up and managing your own Airflow. You can utilize it to connect to your Spark or SQL execution environment and experiment with scheduling for your Spark Pipelines or SQL Models.
By the end of this section, you'll be able to utilize our visual interface to create your DAG and schedule it on Airflow swiftly. Let's delve into it!
Step 1: Setup Prophecy Fabric for Airflow
Prophecy connects to your Spark, SQL, or Airflow environments through Fabrics. Create a Fabric, which is a logical execution environment, and link it to your Airflow Instance. When connecting to Prophecy Managed Airflow, you don't need to provide any authentication details as authentication and authorization are automatically handled.
Step 2: Adding Databricks and Other Connections
To enable Airflow to communicate with your Spark, SQL, or any other third-party tools, set up connections. In Prophecy Managed Airflow, you can quickly and easily create connections directly through our UI. Simply add a connection to your Fabric and provide the authentication details. To learn more about Prophecy Fabrics, click here.
For Databricks, you can select existing Databricks Fabrics.
Further details for various connection types can be found in our documentation.
Step 3: Create an Airflow Job and Add Your Gems:
Now, you can create a new Job and add Gems for different tasks. Drag and drop a Databricks Pipeline Gem to execute a Databricks Pipeline.
Select your Pipeline to run. Choose the Fabric and Job size for execution. You may also override any Pipeline configurations as needed.
Step 4: Run and Debug your job:
While constructing your Job, you can easily test it by clicking on the play button.
Once you have done this, the UI displays the outputs and logs of any task, facilitating their utilization in downstream tasks.
Step 5: Schedule and Monitor
Once you've developed and tested your Job, enable the Job and commit your changes. You can commit, merge, and release directly using the user interface to schedule your Job on Airflow or use the Prophecy Build Tool to release it with your own CI/CD tools.
Additionally, you can monitor all scheduled runs through the Observability tab. This offers a view of past and current runs of any Jobs released on a Fabric. Navigate between Attention Required, All Events, and Job Runs to access specific runs. You can also review the logs of any run. The logs will help you to debug any issues.
Summary
While the integration of Apache Airflow with Databricks offers efficiency and reliability in orchestrating data pipelines, it's important to note that Airflow's adoption often requires navigating a steep learning curve as it requires users to have a level of proficiency coding in Python.
Powered by a visual interface on top of Airflow, Prophecy offers a streamlined solution for orchestrating complex data workflows. Prophecy Airflow empowers users to overcome the challenges of setting up and managing Airflow instances, enabling efficient scheduling, monitoring, and management of data pipelines. With the ability to easily create and schedule Airflow Jobs, construct DAG-based workflows, add custom tasks, and monitor job runs, Prophecy simplifies the process of data pipeline management, empowering data teams to focus on driving insights and innovation. Whether it's executing Databricks Pipelines or integrating with other tools, Prophecy provides a unified platform for end-to-end data pipeline orchestration, enhancing productivity and scalability for organizations of all sizes.
Ready to give Prophecy a try?
You can create a free account and get full access to all features for 21 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.
Ready to see Prophecy in action?
You can create a free account and get full access to all features for 21 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.
Get started with the Low-code Data Transformation Platform
Meet with us at Gartner Data & Analytics Summit in Orlando March 11-13th. Schedule a live 1:1 demo at booth #600 with our team of low-code experts. Request a demo here.