Getting started with low-code
Getting started with low-code
Let's create a pipeline using Prophecy's visual, low-code interface for Spark.
Let's create a pipeline using Prophecy's visual, low-code interface for Spark.
Table of Contents
Together we’ll build a getting-started pipeline using Prophecy's visual design tool. We’ll read, transform, and write a dataset. Our visual pipeline "ExploreTPCH" will be converted to pySpark code and committed to this repo. This is the first of many Prophecy blogs to use TPC datasets [1] :
“This [TPC-H] benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.” - Source: TPC organization
Let’s build this pipeline together...
- Signup (video 0.5m), create a project, pipeline (video 1m)
- Read a TPC-H table from a Snowflake Database
- Transform the data
- Write out as a Delta Table
We’ll focus on the LINEITEM table from TPC-H. The table contains approx 6 million records and 16 columns. Each record represents a line-item order for a fictional company. Let’s take a look at table LINEITEM:
The LINEITEM table fits into the TPC-H schema definition as follows:
The LINEITEM table is usually used to benchmark business queries, so let's try one! Let's do an aggregation with group-by using our visual drag-n-drop interface:
We’ll order the records and write the resulting file as a Delta table:
In a just a few minutes, we’ve created a visual pipeline across data sources. Prophecy generates Python (or Scala) code based on this visual pipeline. Let’s commit the Python code to a Github repository (mine is here). See the aggregate function written in Python to call the Spark API:
That was easy! We completed a standard example query for a TPC-H table. What about exploring and transforming our data? Let's do some data cleaning:
We accomplished a lot in a short blog:
This pipeline committed to Github is ready to be packaged and deployed using SDLC best practices - peer review, unit testing, CI/CD, scheduling, and monitoring. Follow along to get started with your own PySpark pipeline. Overcome the barriers to entry and use this visual tool to build your production-quality benchmarking pipeline. Have a go with low-code tooling and let us know what you think! Ping me or schedule a session with me or my team.
Ready to give Prophecy a try?
You can create a free account and get full access to all features for 21 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.
Ready to give Prophecy a try?
You can create a free account and get full access to all features for 14 days. No credit card needed. Want more of a guided experience? Request a demo and we’ll walk you through how Prophecy can empower your entire data team with low-code ETL today.
Get started with the Low-code Data Transformation Platform
Meet with us at Gartner Data & Analytics Summit in Orlando March 11-13th. Schedule a live 1:1 demo at booth #600 with our team of low-code experts. Request a demo here.