Cloud data engineering on Spark — factors for transitioning to the stack

Cloud data engineering on Spark — factors for transitioning to the stack

For enterprises, navigating the cloud transition can be tricky. This describes the two paradigms for ETL in the cloud, and the factors to consider when choosing one.

For enterprises, navigating the cloud transition can be tricky. This describes the two paradigms for ETL in the cloud, and the factors to consider when choosing one.

Raj Bains
Assistant Director of R&D
Texas Rangers Baseball Club
November 8, 2022
May 16, 2024

Table of Contents

Data Engineering based on Spark for the execution layer, provides high performance and scale on commodity processing. If you use Databricks, it adds transactions from Data Warehouses via delta lake providing the best product in the cloud by a large margin.

The data from on-premise operational systems lands inside the data lake, as does the data from streaming sources and other cloud services. Prophecy with Spark runs data engineering or ETL workflows, writing data into data warehouse or data lake for consumption.

Cloud Data Engineering Architecture

Considerations for the Move to the Cloud

If you're moving you ETL to Data Engineering, you're deciding what your architecture for the next decade or more. It is important to think through the following issues:

  • How will my Existing Data get there?
    Moving historical data is quite simple, however you also need to move your on-premises transactional data continuously into the cloud data lake with an SLA (of say 4 hours).

    The solution here is easy with multiple products in the market. Databricks has multiple partners here, Delta lake is transactional, and this is a well solved problem.
  • How will my Existing Workflows get there?
    A large Enterprise typically has thousands of workflows that run daily. These have been developed over decades in proprietary formats. Net new workflow development in the cloud helps develop expertise in the new environment and is a good place to start, followed up by moving the existing workflows.

    The solution is to either manually rewrite these workflows or use Prophecy - we have developed Transpilers - source to source compilers for AbInitio, Informatica and other proprietary formats, that can rewrite the workflows into native Spark code with 90% plus automation. Using Prophecy will accelerate your move to be an order of magnitude faster and an order of magnitude cheaper.
  • How will we Productively Develop in the cloud?
    Your development process and tooling in the cloud will be different. The current ETL tools don't quite work if you want to use Spark rather than their proprietary processing engine - the development experience is bad, the code quality is bad, and if you don't like the code they generate - tough. If you need to write custom code, say goodbye to lineage - you're out of their boxed solution.

    There are two bad options - notebooks and code IDEs. The solution here is to use Prophecy. We've thought deeply about this and created a unique development environment where you can develop in visual drag and drop mode, or in code editor (based on Monaco - the Visual Studio code editor), and switch between the two instantaneously.
  • How will you do Cluster Orchestration?
    When first looking at the cloud, you see that every cloud provides one or more options to spin-up a managed Spark cluster using an API. What's less clear is that a full orchestration layer is required on top of that. On premises, you have a persistent data warehouse or big data cluster. On the cloud, trying to have a large cluster and working hard to manage multi-tenancy is not very smart - though your on-premises experts in administration might suggest otherwise.

    So, most teams will move to ephemeral clusters that are spun up for development, or for executing a workflow and then spun down with datasets residing in cloud object stores (such as s3 or azure blog storage). Now you have need to manage this orchestration yourself with roll-your own scripts including handling errors.

    The solution here is to have your development tools and your schedulers have good integration with the clusters. On Azure, data factory has good API integration with Databricks for example. At Prophecy, we're building not just an API integration, but also integration of performance data, logs, metadata and much more importantly multiple virtual environments. Let's look at these next.
  • How do you manage your Virtual Environments?
    Once you've moved to ephemeral clusters, now you have to have multiple virtual environments - development, integration and production.

    In each of these, you will need to have files in object store, databases and tables in (Hive) Metastore that are separate in different environments. When you're in the development environment, the clusters that are spun up need to show you the tables only in that environment - so all the ephemeral clusters need to share the Metastore of their environment without you having to do this using scripts.

    Also, when you develop your workflows, you'll need a good configuration system, so that your workflows will run in your different environments where datasets might be in different forms and locations, and certain steps, such as loading into production systems, is disabled in other environments. Legacy Visual ETL tools have done a fairly good job here.

    The solution here is to roll your own using custom scripts. At Prophecy, we ran into this, and so we have built the concept of Fabrics into our products where configuration, code development and cluster orchestration has the concept of virtual environments built-in. Our Fabrics can connect to existing Metastores or provide you with one.
  • How about Metadata & Lineage?
    The legacy visual ETL tools provided good solutions here. Informatica is known to have a good metadata system as the core strength of their product.

    However, with a move to big data clusters, there has been a regression here. The business users are trying to make a decision based on a value, and they want to understand which systems modified it, and how was it computed - the formula. A lot of web companies have rolled their own lineage systems - however none of them operates at column level and they are developed for their specific environments.

    Also, for handling production failures, when you see an incorrect value - you want to see what workflow last modified that column. This is likely not the last workflow that wrote this dataset - if you're writing 1100 columns, the last workflow will maybe modify 20. Now you want to traverse back, following a single column to quickly locate the failure, and you can't. Your L2 support is rendered useless and developer teams are meeting to chase down the error, by looking into their Git repositories.

    There is no good solution in the market here, and though it was not originally an area of focus for us, we ended up building a system for column level lineage that computes this lineage based on crawling your Git repositories with in-built language parsers.

    There do exist metadata solutions on the consumption side such as Alation, Collibra - but they have mastered the analyst - data warehouse interaction, and do not have any presence on the production side. So they don't understand how a value was computed at all.
  • How about Agility - Continuous Integration & Continuous Deployment?
    One of the big promises of moving to code based Data Engineering instead of Visual ETL is that you will get agility - moving workflows faster to production, providing business value faster - allowing your Enterprise to move fast like technology companies - so that technology companies don't eat your lunch.

    No one is able to do this. First, for continuous integration you require unit tests, and if you're writing Spark code (or SQL), you don't know what your units to test are. In practice, we've seen a goal of 70% coverage - 10 workflows out of 5,000 have this, and rest are under 2% coverage.

    For continuous deployment, there are two core things that allow you to deploy fast - the first is having enough information to make a push-to-deploy decision, and second is being able to recover from failure.

    There is no good solution in the market here, and no one except Google and a few other companies I can count on a single hand is able to do this, and you are definitely not going to be able to roll your own solution. At Prophecy, we've built a good solution for continuous integration, our code structure uses standard components (and you can define your own) that are the units to test. The development environment makes developing and running tests a breeze and it integrates will with CI tools. For continuous deployment, we're working on it with blue-green deployments, parallel runs and analysis of downstream impact. These require configuration and column-level lineage that we have, so it is only approachable by your development IDE.
  • How about Scheduling?
    Most Enterprises are coming from legacy scheduling products such as ControlM, Tivoli or Tidal. Now when you move to the cloud, you'll move to a new scheduler.

    Some companies have adopted Airflow that is widely supported, but has a steep learning curve and just like the legacy Hadoop stack - is painful to setup and use, with a UI from the nineties to match. You can use it, as long as you use it for scheduling only and move the entire workflow out to Spark. Seeing the clear gap in market, multiple new products have come up - but there is no clear winner right now - and none that we think is significantly superior.

    At Prophecy, we think the schedulers need to be declarative, and the pattern for rollbacks needs to be a first class citizen. For now Airflow is the only reasonable option, and we've built it into our product, so you'll be able to use it.

Summary

As you move the cloud, you are making a decision on what technologies to use in the Cloud and how you will get there.

We covered the two ETL paradigms - Data Warehouse based (e.g. Snowflake & Matillion) or Data Engineering based (e.g. Databricks & Prophecy) and why we think Data Engineering based solutions are the better approach.

We did a deep dive into what a Data Engineering based solution and We discussed various issues in cloud transition including how to migrate to the cloud, and how to be successful on cloud. We covered what to consider when migrating including migration of existing assets - datasets and workflows, adopting a productive development environment on cloud, orchestrating ephemeral clusters, managing virtual environments, metadata and lineage, agility with continuous integration and continuous deployment and scheduling.

We left out security - a critical piece that has many aspects to it and deserves it's own post.

We work with large Enterprises on the cloud transition, providing Prophecy Migrate to automatically re-write workflows to Spark and Prophecy Data Engineering IDE to succeed in this new environment. We look forward to discussing these issues and any others that we might have missed in the post.

Ready to give Prophecy a try?

You can create a free account and get full access to all features for 21 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.

Ready to give Prophecy a try?

You can create a free account and get full access to all features for 14 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.

Get started with the Low-code Data Transformation Platform

Meet with us at Gartner Data & Analytics Summit in Orlando March 11-13th. Schedule a live 1:1 demo at booth #600 with our team of low-code experts. Request a demo here.

Related content

PRODUCT

A generative AI platform for private enterprise data

LıVE WEBINAR

Introducing Prophecy Generative AI Platform and Data Copilot

Ready to start a free trial?

Visually built pipelines turn into 100% open-source Spark code (python or scala) → NO vendor lock-in
Seamless integration with Databricks
Git integration, testing and CI/CD
Available on AWS, Azure, and GCP
Try it Free

Lastest blog posts

Gliding into the data wonderland

Matt Turner
December 18, 2024
December 18, 2024
December 18, 2024
Events

Data Intelligence and AI Copilots at the Databricks World Tour

Matt Turner
October 29, 2024
October 29, 2024
October 29, 2024
Events

Success With AI Takes Data, Big Data!

Matt Turner
October 7, 2024
October 7, 2024
October 7, 2024