The Case for Data Transformation Copilots

The Case for
Data Transformation Copilots

The Case for Data Transformation Copilots

How copilots will change data transformation, and in the process evolve beyond conversational interfaces

How copilots will change data transformation, and in the process evolve beyond conversational interfaces

Raj Bains and Maciej Szpakowski
Assistant Director of R&D
Texas Rangers Baseball Club
June 3, 2024
June 4, 2024

Table of Contents

We're in the age of copilots—AI-powered companions that seamlessly integrate into our lives, leveraging generative AI to boost productivity. From ChatGPT's meteoric rise to 100 million users in just two months to copilots serving diverse sectors like finance, healthcare, and retail planning, their rapid adoption and value delivery are unprecedented. This exponential technological advancement is palpable, marking a significant milestone in innovation.

Generative AI is excellent for programming - code is a more restricted and precise form of language - and it's a domain in which large language models (LLMs) excel. Github Copilot claims a productivity gain of more than 50% and has been adopted by over 50,000 businesses, including a third of the Fortune 500. Application programming is reaping the benefits of Generative AI already.

At this time, when data and analytics are centrally important - the key question is how will generative AI change how we work with data?  For decades, we've known that obtaining clean, high-quality, and timely data poses one of the greatest challenges for enterprises. This challenge is especially critical as enterprises seek to capitalize on the promise of AI. Let’s discuss why it has been hard, the challenges with current solutions, how the cloud data stack is evolving and finally how generative AI can help!

Traditional Approaches to Data transformation have failed

There are two primary approaches to data transformation, and both have fallen short in delivering clean, trusted, and timely data for analytics and AI. Most business teams feel starved for data, while central data platform teams are overwhelmed and can only deliver a fraction of what is needed, consuming excessive resources in the process.

ETL products - built for a different era

Legacy ETL products such as Informatica and last-mile ETL tools like Alteryx were effective solutions decades ago, offering a user-friendly visual interface that boosted productivity. However, they have significant drawbacks.

  • Proprietary Formats: These tools rely on proprietary formats, locking in their customers and limiting flexibility.
  • Scalability Issues: Their data processing capabilities do not scale effectively. To address scaling challenges, they often resort to SQL pushdown into data warehouses or generate inefficient code for platforms like Apache Spark.
  • Extensibility Limitations: When their functionality limits are reached, these tools rely on Java plugins, which do not scale well and were never designed for extensibility.

In the era of scalable cloud data platforms, where code-based solutions and native performance are paramount, these legacy ETL tools fall short.

Code - power to the few

Code is powerful, delivering native performance, optimizing resource use, and reducing costs. It enables the application of best practices in software engineering and is well-suited for developing standards and frameworks across the enterprise. However, while code is the best format to store business logic, manual code development for pipelines has several drawbacks:

  • Non-Standardization: Over time, code can become non-standard as users leave and new ones join, making it difficult to understand and maintain.
  • Limited Tooling Support: Ad hoc code development lacks support for crucial tasks such as step-by-step execution, column-level lineage, impact analysis, and pinpointing failures, which reduces quality and productivity.
  • Accessibility for Non-Coders: Effective pipeline development must be accessible to a wide range of data users, including those in line-of-business roles who may not be proficient in coding.

In a world where data must be accessible to a large number of users, there are simply not enough coding experts to scale manual code development for pipeline logic.

We have seen that neither of the existing approaches can be the complete solution to making data available for analytics and AI at the speed and scale we need. Let’s understand the cloud data stack, its structure today, what it is lacking and the opportunity for generative AI to provide a new solution.

Processing & Productivity - the two cloud data anchors

We assert that there are only two stable positions in the cloud data market - power and productivity platforms

Cloud data platforms - the layer for data processing

Cloud data platforms are extraordinarily powerful, they can process data at scale, let you build streaming pipelines, batch pipelines and run ad hoc SQL queries. They process data from database tables, APIs, text and PDF documents for building reports, business intelligence, precision AI and generative AI.

Code is the lingua franca of cloud data platforms with Python for Spark and Airflow and SQL dominating the market.

Cloud Data Platforms will continue to grow in functionality

The anchor in cloud data platforms is data transformation. Once an enterprise selects and invests in one or more solutions, they become difficult to replace due to data gravity. cloud data platforms that control data transformation can then absorb adjacent data capabilities.

Venture capitalists have funded numerous point solutions in the modern data stack, but most of these cannot sustain themselves as independent companies. They must either be acquired or die.

Cloud vendors are severely limited by the FTC and will be unable to acquire many of these startups. They may find creative ways to work around this, such as investing and setting up OEM arrangements, but these will be rare. Examples include blocking Amazon acquiring iRobot or Adobe acquiring Figma, and then Microsoft getting creative with OpenAI and InflectionAI.

Independent cloud data platforms like Databricks and Snowflake will continue to acquire startups and integrate quality talent from these founding teams, allowing them to grow and continue to provide ever more power and functionality to their customers. In the last 6 quarters, Databricks has acquired 5 companies including Arcion, MosaicML and Okera, while Snowflake has acquired 6 including Neeva and Leapyear.

Fragmented functionality is reality

The use of cloud data platforms necessitates leveraging functionality native to each platform. Each platform will perform best with code native to it, maintaining its own metadata, governance, and observability, and acting as the system of record. While data quality features are not yet native to each platform, independent cloud platforms will likely acquire these capabilities over time.

However, to address this fragmentation, the last thing anyone wants is yet another product to unify functionality across these platforms. We don’t need another central observability, another governance or another metadata solution. 

The solution is to have a copilot to make you productive with all the systems you already have.

The case for Copilots - the layer for productivity

Since cloud data platforms are focused on power and functionality, the copilots must balance this with focus on the needs of data users, from data engineers to line-of-business data analysts and data scientists. Copilots should provide the right experience for each user, making it easy and productive for them to perform their jobs on their cloud data platforms without needing to understand all the underlying complexity.

Copilots don’t bring another format to get locked into, nor are they another system of record. They do not take power away from users. Instead, copilots serve as AI companions, making users' lives easier and enhancing their productivity.

Copilot Wishlist

The Copilot with cloud data platforms should form the complete Data Transformation Stack. Copilots must democratize access, empowering all data users and turning everyone into wizards. To achieve this goal, several key requirements for the copilot emerge:

  1. Integrated & Comprehensive
    1. The copilot should be integrated with the cloud data platforms providing native support for Databricks, Snowflake and other Apache Spark and SQL based solutions. 
    2. The copilot should have comprehensive support for entire lifecycle - Development, deployment, observability 
  2. Intuitive & Intelligent
    1.  The copilot should provide familiar interfaces to all data users, including a visual interface to enable all data users including line of business data analysts and data scientists and makes them productive. Code interface helps avoid lock-in and allows for best software engineering practices. Integrating the visual and code interface makes magic happen, enabling all data users to work together.
    2. We want Generative AI to help teams do half the work, maximizing automation that includes recommending transforms, completing business logic, adding tests and generating documentation
  3. Open & Extensible
    1. The copilot should help you develop code in standard formats such as Python and SQL and you should be able to run your pipeline with no closed dependencies or lock-in.
    2. Copilot Plugins enable data users to develop standards and frameworks using code, while visual and AI integrations make these extensions accessible to all data users.

These requirements are pretty straightforward and we expect them to be non-controversial, however the execution on this set of requirements is hard.

Over the past half-decade, we have been dedicated to building a copilot guided by these key requirements. While some areas, like observability, are still a work in progress, all other aspects are highly mature. Generative AI is moving fast and continues to get better. Enterprise customers such as J&J, Optum, Texas Rangers, and Aetion are already experiencing significant benefits.

Join us at the Data+AI summit on Jun 10 to learn more!

Ready to give Prophecy a try?

You can create a free account and get full access to all features for 21 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.

See the Data Transformation Copilot in Action

Ready to give Prophecy a try?

You can create a free account and get full access to all features for 14 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.

Join the Prophecy team at the Databricks Data + AI Summit June 10-13 and book a time to see the Data Transformation Copilot in action. Not making it to the Summit? Request a demo and we’ll walk you through the Prophecy Data Transformation Copilot today.

Get started with the Low-code Data Transformation Platform

Meet with us at Gartner Data & Analytics Summit in Orlando March 11-13th. Schedule a live 1:1 demo at booth #600 with our team of low-code experts. Request a demo here.

Related content

PRODUCT

A generative AI platform for private enterprise data

LıVE WEBINAR

Introducing Prophecy Generative AI Platform and Data Copilot

Ready to start a free trial?

Visually built pipelines turn into 100% open-source Spark code (python or scala) → NO vendor lock-in
Seamless integration with Databricks
Git integration, testing and CI/CD
Available on AWS, Azure, and GCP
Try it Free

Lastest blog posts

Events

Data Intelligence and AI Copilots at the Databricks World Tour

Matt Turner
October 29, 2024
October 29, 2024
October 29, 2024
Events

Success With AI Takes Data, Big Data!

Matt Turner
October 7, 2024
October 7, 2024
October 7, 2024
ETL modernization

Weigh Your Options As You Move Off Alteryx

Raj Bains
November 18, 2024
November 18, 2024
November 18, 2024