Introducing Cross-platform column-level lineage-diff

Discover how Paradime's revolutionary cross-platform column-level lineage diff can help you deploy changes quickly, confidently, and safely.

May 10, 2023
A reading icon
4
 min read
Introducing Cross-platform column-level lineage-diff

Introduction

Today is Day 3 of our Paradime v3.0 release, and I am beyond delighted to launch Cross Platform Column-Level Lineage Diff. We have built the only zero-warehouse-compute solution on the market today that calculates lineage-diff before the data, tables, or logs hit the warehouse. It's blazingly fast, cheap, robust, and cross-platform.

Problem - CI/CD in analytics development

The analytics development process has 4 distinct steps:

  • exploration - where one is trying to understand the data they would then eventually model into a table or metric or KPI
  • development - this is where dimensional modeling happens, business logic gets translated into SQL and ultimately dbt™* models are born
  • CI/CD - this is where each pull request (PR) goes through testing, checks and approval before merging into main branch
  • production deployment - this is where data models get deployed to run day in and day out in production

Speaking with customers like Tier Mobility, the biggest bottle-neck in the above process for growing teams was the CI/CD step. For even the smallest changes, it can take analysts between a few hours to a few days before their work is approved and merged.

Problem - CI/CD in analytics development | Paradime | Analytics Engineering using dbt™
Typical analytics development process

Now, if we look at the anatomy of a model change in analytics, it has three consequences viz.

  1. code - or code-diff that we are all familiar with and is solved
  2. downstream dependency - because data flows from left to right, upstream changes can have downstream impact all the way to dashboards, reports, and users e.g., removing a column can break the CRO report - remains a big issue today
  3. row-level data - or data-diff, which the amazing folks at Datafold are already solving, and practioners use data quality tests to get comfort

And because the impacts (2 & 3) are not readily available in a PR, it takes anyone ages to approve a PR. People end up in meetings, discussions, and back-and-forth on Slack that takes days.

Our solution

At Paradime, we wanted to build a solution so that our customers could deploy changes safely and confidently.

And so was born our cross-platform column-level lineage-diff that relies only on code and information schema. Customers in our Growing or Scaling Tier, who already use our Turbo CI Github app, will have this available automatically. Customers in our Starting tier will have to upgrade to the Growing tier.

Step - 1

Whenever someone opens a PR or pushes changes in an existing PR, we analyze the column-level diff between

  • the lineage in the repo default branch and
  • the lineage in the the current branch of the PR.

This diff gives us the impacted column level fields.

Step - 2

Based on the impacted column level fields, we determine all the impacted dbt™* nodes and the dashboards in the BI layer. Currently we only support Tableau dashboards but others like Looker are coming soon.

Problem - CI/CD in analytics development | Paradime | Analytics Engineering using dbt™
Output of lineage diff in a Pull Request

Step - 3

We present the impact summary and detailed nodes in a clean and easy-to-use report within the Github PR itself.

How do I get started?

Getting up and running with Lineage-diff is super easy and straight forward too:

  1. Signup for the Growing Tier on Paradime - existing customers don't need this, though
  2. All you need to do is install the Paradime Github app within your Github organization in a couple of clicks from our integrations page
  3. Authenticate your Github user account with Paradime

And that's it. Paradime will automatically start computing the lineage-diff on every PR and will publish a report in the PR.

How is Paradime solution different?

At Paradime, our column-level lineage diff is:

  • Blazingly Fast - because we only read and compile your models and we don't need to create views or materialize tables
  • Cheap - because we only read the information schema; it's a cheap query that does not burn through warehouse credits
  • Robust - we support Snowflake, BigQuery, Redshift, Firebolt, Databricks dialects and our average accuracy is 96% across all dialects. As a bonus, we support the dreaded "select * from fact_table". Analysts don't have to explicitly write out 600 columns and aliases from an SFDC table for column-level lineage to work.
  • Cross-platform - as we connect the dbt™* and BI layer in Paradime, we can show impacted dashboards too. This detail gives complete end-to-end visibility on the blast radius of change to anyone looking at the PR.

Compared to Paradime, pretty much every column-level lineage solution on the market, relies on parsing query logs to construct table and field level lineage. This approach makes current solutions slow, expensive and brittle:

  • Slow - because it needs all your models to run before the query logs are available for parsing. Different warehouse vendors make their query logs available at different frequencies. Imagine if you have 500+ models or TBs of data, that's a lot of wait time per push and CI run. This wait can run into 30mins to a few hours.
  • Expensive - every CI run consumes warehouse credits, and when you stack up all the PRs, commits per PR and CI runs per PR, the cost is not trivial
  • Brittle - current solutions need SQL written in a certain way for the column-level lineage to work, e.g., explicit column names, clear aliases, and the list goes on - this is impractical. We can't expect an analyst to explicitly type out 600 column aliases from an SFDC table for column-level lineage to work. Current solutions are particularly poor at expanding the famous "select * from fact_orders".

All of the above makes current solutions, great on demos, but practically useless.

FAQs

  1. Is only Github supported? - Yes, currently we only support Github but we will be adding support in Gitlab, Bitbucket, and ADO in the next quarter.
  2. Which tiers are supported? - This feature is only available in the Growing and Scaling Tier. Companies in these tiers face the most pain, with dashboards breaking from upstream changes.
  3. Will this consume Github Action minutes? - No, because the compute happens on our infrastructure and not on Github-hosted runners.
  4. What BI platforms are supported? - Currently, only Tableau but more platforms are coming soon.

Conclusion

We have spent quite a few months working on this and it's an amazing engineering achievement by the team. We had to build our own parser and algorithm in Go from scratch to make this work. But this is really early stages. Analytics teams have never had this unprecedented view. We think a lot of powerful features and functionalities can be built on top to make the analytics engineering process come even closer to software engineering.

Our lineage-diff will also become instrumental in organizations implementing data mesh on top of dbt™* to reduce the adverse impact of breaking changes on other data products.

Sign up for a FREE 14-day trial or schedule some time with our team to learn more about Paradime 💪

Interested to learn more?
Try out the free 14-days trial
Close Cookie Preference Manager
By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage and assist in our marketing efforts. More info
Strictly Necessary (Always Active)
Cookies required to enable basic website functionality.
Oops! Something went wrong while submitting the form.