Ab Initio to Spark: Modernize your ETL and lower costs

Ab Initio to Spark: Modernize your ETL and lower costs

Prophecy also provides a highly automated replacement for legacy ETL products to accelerate the journey to open source and cloud computing.

Prophecy also provides a highly automated replacement for legacy ETL products to accelerate the journey to open source and cloud computing.

Raj Bains
Assistant Director of R&D
Texas Rangers Baseball Club
July 12, 2022
April 22, 2023

Table of Contents

Enterprises are focused on becoming agile and cost efficient in the face of increasingly competitive business environment. Businesses are looking to their CIOs and CTOs to lead this transformation. On top of the agenda is a move to the cloud, and the adoption of new technologies & processes that can help deliver value faster to their business. Lowering costs remains on top of the agenda as it frees up resources, reallocating them from keeping the lights on to strategic initiatives for the business.

In this blog, we'll share how Prophecy affords the opportunity to rapidly modernize data engineering. In addition, this modernization comes with immense cost savings. Businesses that are using AbInitio can save millions of dollars every year by moving to Prophecy and Spark.

Why Move to Spark?

Move to the cloud, Open Source

The current ETL tools have been in use for at least a couple of decades. As Enterprises move to the cloud - they’re picking the right stack for multiple decades. Legacy Enterprise products such as ETL tools and data warehouses are being left behind due to architectural change. For customers who have chosen to continue to stay on premises for now, there is a move often to open source systems.

Apache Spark makes an excellent choice on most of the priorities of our customers. Here's why:

  • Spark works at any scale
  • Spark has SQL Layer for productivity
    > SQL has been used heavily for ETL in data warehouses such as Teradata. It is always needed for the final step of loading into the data warehouses while merging new data with historical data.
  • Spark SQL processing is more cost efficient than SQL processing on Data Warehouses because warehouses have heavy indexing and caching that are useless for ETL processing since the data passes only once, and is not reused like business intelligence does.
  • Spark has DataFrame API that can allow one to use language integrated APIs - thisbecomes very important for ETL which requires configurations, and requires tests -both of which are hard with SQL alone.
  • Spark has a powerful general purpose distributed processing engine
    > If you need to do transforms that don’t fit into SQL, you can step to a lower level and use the RDD API - and create non-SQL functions that are preferably standardized for the team.
  • Spark provides Machine Learning - Locking yourself into a technology that does not provide ML is a bad bet.

On Premises

While many Enterprises are moving to the public cloud, some are choosing to stay on premises. On premises, there is a move to containerized deployments based on Kubernetes with a move from the pet model (specific and known machines) to cattle model (N containers). These points apply equally to these users and we have customers who are on premises- moving workloads to a Spark on Hadoop or Spark on Kubernetes deployment.

Cost Reduction by Millions

A top factor that motivates a move from AbInitio to Spark is often that AbInitio shows up as a very large cost item on the current budget, with projected data and workload increases making it unsustainable for many organizations. While we are never privy to the exact numbers, it is not uncommon for us to hear of bills in the range of $5M a year for smaller Fortune 500 workloads and $10-$15M for the larger ones - say Fortune 50. We typically find that the cost can be reduced by around 70% by moving to Spark - and it pays for the migration cost rather quickly - it can even be self-financing. See the financial planning section below for more details.

Migration: Factors to Consider

Migration to Spark requires two primary things to consider:

  • How do I get from my legacy systems to Spark?
  • What do I need to succeed on Spark with my team, resources and constraints?

Defining Success on Spark

For migration, first we need a clear view of the end state that we’re targeting. Most companies will have an end state that needs to have the following:

Modernizing ETL to Open and Agile Data Engineering

The modernization piece is sometimes under-appreciated, so we’ll start with this. Many of the modern silicon valley companies have processes that allow them to rapidly move changes from development to production. These agile processes, however imperfectly implemented have been an improvement in many Enterprises as well - primarily in application development. The same practices need to be made widespread in data processing, since the alternative is not getting new analytics faster for decisions, and will be a competitive disadvantage - making your organization a technology laggard.

Moving ETL workflows rapidly from development to production requires standard software engineering practices:

  • Workflows should be stored as code on Git using 100% open source APIs
  • Workflows should have tests - unit tests, integration tests and data quality tests on Git. This ensures that on every change - you can run tests and have high confidence that the workflows will not have issues
  • Continuous Integration should run tests on every commit keeping consistently high quality
  • Deployment should be automated so that no manual errors are made when an approval to promote is made by the responsible person.

Prophecy is built to support these practices on Apache Spark.

Succeeding with existing skillsets

As soon as your team of ETL developers moves from AbInitio to Spark, they will have to become adept at the new technology. It does not help that apart from being new, Spark is only a processing engine and is missing a lot of features present in AbInitio. This can land you in a mess.

We commonly see new teams struggle to build best practices on Spark as they’re moving thousands of workflows over, examples include:

  • Many Visual ETL users will not know how to write Spark code. Even the ones who can write code will not immediately produce good Spark code since there is a learning curve
  • Every team and user will have different structure of workflows
  • Many workflows quickly moved over will not perform well

Prophecy provides features for ease of use that enable your ETL developers to immediately produce high quality Spark workflows. It also encodes a framework for best practices that ensures you’re setup for success from the beginning.

Feature Checklist

There are many features that are present in AbInitio, but are missing in Apache Spark. You need to ensure that you have a solution for these in your target Architecture. The primary feature buckets are:

  • Visual Interactive Development - AbInitio provides visual drag-and-drop development environment to quickly build workflows. You can run workflows interactively as you’re developing them. Either there should to be an equivalent product on top of Spark, or there needs to be training in place for ensuring the team can succeed developing quality code on Spark.
  • Development Framework - AbInitio visual development provides standardization - with standard components such as reformat or normalize that all the workflows are built from. There are also reusable custom components built by users or the AbInitio consultants. What is the development framework on Spark? Not having standard blocks means everyone will have code that is different.
  • Configuration System - AbInitio provides configuration variables at multiple levels - systems, project, workflows. Different execution environments are configured differently - the same dataset might be indifferent locations across the development, test and production environments.
  • Metadata : Search & Lineage - Governance requires quality metadata. There needs to be a mechanism to search particular values. For ETL developers, you need to understand downstream impacts of your changes. For analysts, you want to know how values came about. For governance, you need to track sensitive columns and understand regulatory restrictions on datasets.
  • Testing - There needs to be a story for how the workflows will get tested, so that incremental changes to the workflows can get pushed fast to production. Tests should be for the business logic in a workflow - at component (unit) and workflow level. Data quality tests area absolutely essential - they test the final product and give you the highest confidence.
  • Scheduling - How will the workflows get scheduled to run? How can I monitor or restart on failure? There should be a mechanism in place for this.

Defining Success of Migration

Once we know what successful Data Engineering target on Spark looks like, we must nowdefine how we’ll measure the success for Migration to this new process.

Migration should produce quality code

Migration requires an understanding of the AbInitio format and an understanding of the Spark format, and finally an understanding of the business logic. Unless there is a target framework on how the converted code should look, the converted code will be a mess.Therefore, the task here is to define how the target code on Spark will be structured.

Migration timeline & cost should be predicatable

Migration timeline is notoriously unpredictable. For a medium complexity workflow, our customers estimate that they need to put 2 engineers in a box (one each with AbInitio and Spark skillsets) and they can convert and test a workflow in about 2 weeks.You’ll be spending money on migration, some of which will be recovered from lowered cos ton Spark. However, the longer the migration takes, the longer you’re paying for AbInitio - and this can significantly change the cost equation.

Without some amount of automation, it is very hard to achieve predictable timelines and costs, especially as you’ll be learning how to do migration while doing it the first time - having multiple missteps along the way.

Migration timeline & cost should be predicatable

Once you’ve decided to move from AbInitio to Spark, and the targets are clear - you know how you’re going to succeed on Spark and your migration targets are clear, it’s time to do the financial planning.

Financial Planning

Financial Planning has to take into account these major factors

  • How much is the existing spend - AbInitio licensing & machine costs
  • How much will be the new spend - licensing & compute
  • How much will it cost to move
  • How much existing cost will be reduced
  • When will the existing cost be reduced - this depends on the speed of migration

Financial Planning

Let’s work through an example, where we’re estimating the numbers. In actual case, there will real numbers to be plugged in, but directionally the planned finances should be structured as below:

To make numbers simple, let’s consider you’re paying AbInitio $6M a year that includes
> 100 - 200 users
> 64 -128 cores
> 5,000 -10,000 workflows
> 1 metadata server

Financial Planning

Moving from AbInitio to Prophecy in this example (see table below) is projected to save $4.5M every year. This does not include the cost of hardware before or after, but that cost is expected to be comparable, and insignificant when compared to the projected savings.

The migration cost can be paid for by the reduction in licensing cost if there is no cap in play for the AbInitio license, otherwise you’ll have to look closely at your contract. The first part of migration requires setting up the automation and moving 30% of the workload, and the cost reflects that. As we get to the second part, there are many fewer things to figure out and a lot more automation can come into play. Finally, it is common to leave a smaller AbInitio footprint around for longer, once the cost is not a concern - this might be preferable for stability if it is pulling data out of multiple operational systems.

Projections with manual conversion

Whenever looking at the cost of automated migration, there must be a comparison with alternatives to see if there is a better approach. There are a couple of companies that claim they can migrate AbInitio to Spark - but this is primarily manual and slow, it is often a time sink with high failure rate.

The only other viable alternative is to do manual conversion. This will come with the benefit that current developers will be able to look at the workflows, and not blindly migrate them. While this can be the executive direction, in practice against deadlines - often work is done hastily, and the current team will have to be augmented with system integrators to assist the additional load.

Let’s take an example of converting a workflow of medium complexity:

  • Code conversion - For each, workflow will have to be read from the AbInitio UI (there is no export available), and for each component (in the drag-and-drop visual graph) equivalent code has to be written in Spark.
  • Data Match - To test, sample data will be brought to test and integration environments, and the two workflows will be run in parallel to compare the results. Often there will be differences in the output and some debugging is required to figure out the issues. Once this is done - the workflow can be run in parallel for some days to ensure that results match.
  • Performance - When the datasets are large, there will often be performance issues since the workflow was designed for AbInitio. These can usually be fixed in a few days.

We have found that this process usually takes 2 weeks for 2 engineers where one understands AbInitio well and the other understands Spark well. Here is a table of estimated work

One has to find engineers who are skilled in Spark and AbInitio. Let’s assume
> We have a team of 30 engineers dedicated to this - where 15 know AbInitio well and 15 know Spark well
> Let’s assume an engineer is offshore and costs $120,000 per year.

Here, you’ll see the hump of expenditure caused by migration effort

In the final analysis, over a 3 year period while the migration is ongoing, about $19M will have been spent, as opposed to $7.6M over two years for automated migration. In addition, the code might be inconsistent in quality and performance.

Aligning the Organization

The most important dimensions of any major technology transformation program are theEnabling the People, Planning the Adoption and Governing the Process to ensure adherenceto the plan.

Plan to enable the people

Enterprises moving to Spark from AbInitio will need to make sure that all the users (ETL developers, Analysts, Managers and Data Officers) are aligned to a common mission of modernization without any sense of insecurity. They need to be assured that they are an integral part of future Data landscape as well. That can be achieved by:

  • Making sure current developers are productive on new ecosystem (Spark). We alreadytalked about making existing skillset successful on Spark by using an intuitive IDE
  • Providing them with the features on Spark that will make them as productive on Sparkas they were on AbInitio. Some of our customers are very passionate about this dimensionand are pushing Prophecy constantly to add new features to help their Data Engineers andeven Business Analysts to be successful on Spark.

Plan to adoption sequence

Adoption plan is a key to making sure transformation is successful. First, there must be enthusiasm. A common mission of modernization can excite and align the teams. Creating a few (one or more) lighthouse teams is the best approach where early adopters are challenged and incentivized to design and execute on the MVP with Spark and Prophecy.

These lighthouse teams, through the process of envisioning, designing and executing theMVPs with Prophecy and Spark, establish best practices for the organization and create a roadmap/ blueprint for the rest of the organization to be successful.

MVPs are typically 3 - 4 months, 5 -10% of AbInitio workload and have initial teams of 10 - 15 engineers. After the MVP, a structured roll out of 6 -18 months (including migrations and Prophecy Onboarding) helps other teams to learn from the champions and join the bandwagon.

In summary, you want to have these things in place

  • Lighthouse Team: The first team that is ready and eager to adopt the new technology
  • Lighthouse Workload: The workload of the first team
  • Budget: The financial plan is approved, and budget is released for the first milestone
  • Target: The target system is functional - it is administered

We have seen that often before the target system is chosen (you’re deciding which cloud to pick), or before a Lighthouse team & workload are identified - the central team will want to start a POC to prove out the migration technology. This wastes time and resources of the customer, vendor and system integrator, and must be avoided.

Monitor & Govern the adoption process

Governance and Executive commitment is essential for this migration to be successful.This migration is a charter of Chief Data Office, a separate Cloud Transformation Office or CIO Org, depending on the organization structure. VPs of individual business units might also take the lead in transforming their business units first.

AbInitio migration can be a standalone migration (move to open source while staying on-premises) or a part of a more complex cloud transformation that includes the DataWarehouse and consumption of data and analytics.

A well structured program governance will define the success of this migration.

  • Define Phases : What is included - team, workload, target, budget
  • Define Success of Phases : What is the Success/Exit Criteria, What KPIs will be measured
  • Define Phases : WhaDefine Monitoring Cadence : Executive level cadence (2-4 weeks)t is included - team, workload, target, budget

Executive presence for bringing the planning process to conclusion, and monitoring if theproject is on track becomes a forcing function for all the teams to be aligned

Our Value

Prophecy Experience

At Prophecy, we have worked with prominent Fortune 500 Enterprises through their journey from AbInitio to Spark. We have built products to move from AbInitio to Spark with high automation.We also have the operational experience of working with multiple teams and system integration partners of Enterprises to achieve rapid migration and modernization.

Working with Prophecy

Prophecy brings tremendous expertise and experience in migrating Legacy ETL workloads to open source and cloud technologies. This will significantly reduce the risk in time and coin.

We also have partnerships with multiple System Integrators and experience working with them to deliver on projects. We also might have a partnership with your Spark provider. This can significantly reduce the time for onbaording.

Reach out to Prophecy.io for

  • A personalized overview of the migration process
  • A demo of automated conversion from AbInitio to Spark
  • A demo/trial of Prophecy Spark IDE

Request a Demo!

We look forward to assisting you in your migration journey!

Ready to give Prophecy a try?

You can create a free account and get full access to all features for 21 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.

Ready to give Prophecy a try?

You can create a free account and get full access to all features for 14 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.

Get started with the Low-code Data Transformation Platform

Meet with us at Gartner Data & Analytics Summit in Orlando March 11-13th. Schedule a live 1:1 demo at booth #600 with our team of low-code experts. Request a demo here.

Related content

PRODUCT

A generative AI platform for private enterprise data

LıVE WEBINAR

Introducing Prophecy Generative AI Platform and Data Copilot

Ready to start a free trial?

Visually built pipelines turn into 100% open-source Spark code (python or scala) → NO vendor lock-in
Seamless integration with Databricks
Git integration, testing and CI/CD
Available on AWS, Azure, and GCP
Try it Free

Lastest blog posts

Gliding into the data wonderland

Matt Turner
December 18, 2024
December 18, 2024
December 18, 2024
Events

Data Intelligence and AI Copilots at the Databricks World Tour

Matt Turner
October 29, 2024
October 29, 2024
October 29, 2024
Events

Success With AI Takes Data, Big Data!

Matt Turner
October 7, 2024
October 7, 2024
October 7, 2024