Data Engineering Battle: Python vs SQL++ vs Visual=Code

Data Engineering Battle: Python vs SQL++ vs Visual=Code

In a Data Engineering Battle, Prophecy presented a complete, low-code data engineering product with a session for low-code Spark, low-code Airflow and column-level lineage that hundreds attended. Here are the poll results.

In a Data Engineering Battle, Prophecy presented a complete, low-code data engineering product with a session for low-code Spark, low-code Airflow and column-level lineage that hundreds attended. Here are the poll results.

Raj Bains
Assistant Director of R&D
Texas Rangers Baseball Club
July 13, 2021
April 22, 2023

Table of Contents

There is a debate underway about what the right programming model for Data Engineering is. There are three approaches - Python, SQL++, and Visual=Code. The Visual=Code is a new approach that Prophecy is working on to address the challenges we are seeing in the field, but there is no consensus on the right approach.

In this blog, we'll articulate the essential complexity of operations we see in Data Engineering, and what each approach is best suited for. By the end of this blog, you will have a structured framework to articulate what approach is best for your team (where you might already start with an implicit understanding of the concepts). The following articulates the various groups of users and operations that we commonly see with our customers.

Common User Group and Operations in Data Engineering

Essential Complexity of Data Engineering

Data Engineering or ETL has an essential complexity that includes some SQL operations and some non-SQL operations. Here are some common operations for the basics of Data Engineering operations

SQL/Relational Operations

SQL operations form the backbone of Data Engineering operations whether you're writing code in SQL, writing DataFrame code in Python, or doing Visual Dataflow programming.

  • Load Operations: These are the operations to load data into a table. Examples here are INSERT, UPDATE, MERGE
  • Common Transforms: These are the most frequently used transforms and form the bulk for data processing and include SCAN (Read), FILTER, JOIN, SORT, GROUP BY, ORDER BY
  • More Complex Operations While frequent, these are not as frequent as the common transforms and are used more in analytics and reporting - PIVOT, ROLLUP, CUBE, WINDOW FUNCTIONS

SQL is a good solution that everyone can use - but there are many operations that are common in data engineering but are not covered by pure SQL. Also, as complexity increases SQL is harder and harder to understand and maintain.

Everyone is enabled with SQL, but only a subset of data engineering operations are covered

Complex SQL

SQL starts to get complex quite fast. There are CTAS, Table Functions, Correlated Subqueries - but let's start with an operation that is quite common - a standard SCD2 merge:

SCD2 merge is a slowly changing dimension merge where the operational database has a field such as an address that changes infrequently, so in the analytical database, you keep of history of various addresses and the dates (from-date and to-date) capturing the period when this entry was active, along with flags to mark the first and last row in a chain. This can be the same for analytics on how long a home delivery order was in ordered, or en-route.

Data Before - with effective dates and flags

New data - no effective dates or flags

Merged data - with updates to new and existing rows

Following is the example code for it. This is clearly SQL that should never be hand-written. This example uses the Dataframe API but can be written as a SQL string. It shows an example where SQL is too low an abstraction.

Spark Scala code showing SCD2 Merge Operation

While consensus view is that these operations should be generated, there are multiple ways to generate  - code generators, macros and functions. The SQL++ approach of DBT gives some basic constructs (macros) to try to handle these operations (datespine, snapshots for scd2). DBT also brings software engineering practices to SQL and is being appreciated by the users for this.

SQL++ with DBT expresses more than basic SQL

Programming Language Constructs

Now, there are many operations in Data Engineering for which SQL is not the right abstraction and you must use a programming language instead. There are a few use cases here. Our customers need to perform operations that must be performed per row and across rows. Here are some example operations

  • Data Quality Library - including computing statistics every day and comparing changes in patterns across days.
  • Lookup from a REST service (too expensive per row - so done per partition). Also, Lookup a set of values and loop through them to find the right one.
  • Encryption, Decryption of certain columns with sensitive data.
  • Writing to Elastic Search, writing to Athena

SQL has always accepted that it is not the right paradigm for these operations and provides numerous mechanisms to call non-SQL code such as User Defined Functions, User Defined Aggregate Functions, and Table Functions which support the full spectrum of use cases from most granular scope calling outside code per row - to passing the entire table out to code and accepting a new table back in.

Writing code in Python can capture these use cases, but only a small subset of users in an organization can produce high quality and standardized code and the productivity is always low.

Python approach to Data Engineering (same for Scala, Java)

Common Pattern Templates

Templates can encode common series of patterns - standardizing practices for various parts of the ecosystem. We've seen standard ingestion templates for pipelines from multiple similar source systems that include best practices such as auditing that correct numbers of rows were output that is required in financial environments.

Enabling All Users with All Operations!

As you can see with the previous approaches - either many users are left out, or many use cases are, greatly limiting what can be achieved.

At Prophecy we have been thinking from scratch what might be the best approach to handle all of the data engineering operations and enable all users at the same time. Here is our approach:

Use Gems: Visual with SQL Expressions

All users must be enabled to use all kinds of transforms and be enabled to build any kind of data engineering workflows, so we've created an interface where all the usage is in SQL - but your operations generate a mix of SQL and non-SQL code depending on the operations.

Build Gems: Code Templates with UI

In the team, you can have a few Gem Builders (or you can ask Prophecy for it). You can write code that you want generated for any particular operation by writing sample code and specifying what information the user of these gems should fill out. As your users develop gems - high quality code is being generated on git. Here is a quick preview of Gem Builder:

Visual=Code: Putting it all together

Now, when you put these two personas together - the Gem Builders and the Gem Users, you have your entire team enabled to perform all the operations you need. Also, all users can build these data pipelines and everyone is developing high quality code on Git.

Summary

There are multiple approaches to Data Engineering, and as different startups are looking at the problem, they're coming up with the approaches they think are best suited to solve them, toiling away to make the lives of Data Engineers better.

We have shared here the framework that we used to figure out the best approach to enable most users with all the common elements we find in Data Engineering. We look forward to tremendous innovation over the next 3-5 years to make Data Engineering more accessible and reducing the toil required for it.

Try it?

Prophecy is available as a SaaS product where you can add your Databricks credentials and start using it with Databricks. You can use an Enterprise Trial with Prophecy's Databricks account for a couple of weeks to kick the tires with examples. Or you can do a POC where we will install Prophecy in your network (VPC or on-prem) on kubernetes. Sign up for you account now:

Sign up for your free Account!

Ready to give Prophecy a try?

You can create a free account and get full access to all features for 21 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.

Ready to give Prophecy a try?

You can create a free account and get full access to all features for 14 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.

Get started with the Low-code Data Transformation Platform

Meet with us at Gartner Data & Analytics Summit in Orlando March 11-13th. Schedule a live 1:1 demo at booth #600 with our team of low-code experts. Request a demo here.

Related content

PRODUCT

A generative AI platform for private enterprise data

LıVE WEBINAR

Introducing Prophecy Generative AI Platform and Data Copilot

Ready to start a free trial?

Visually built pipelines turn into 100% open-source Spark code (python or scala) → NO vendor lock-in
Seamless integration with Databricks
Git integration, testing and CI/CD
Available on AWS, Azure, and GCP
Try it Free

Lastest blog posts

Events

Data Intelligence and AI Copilots at the Databricks World Tour

Matt Turner
October 29, 2024
October 29, 2024
October 29, 2024
Events

Success With AI Takes Data, Big Data!

Matt Turner
October 7, 2024
October 7, 2024
October 7, 2024
ETL modernization

Weigh Your Options As You Move Off Alteryx

Raj Bains
November 18, 2024
November 18, 2024
November 18, 2024