Stuff we shipped #6
The Paradime product update #6. We shipped working with prod data during dbt development and injecting runtime variables into materialized tables.
Kaustav Mitra
Jun 13, 2024
·
3
min read
Introduction
Welcome to another edition of what we shipped. This one is a bit special because this will be our last shipment for Paradime v2.0. Yep, you heard it right - Paradime v3.0 is coming very, very soon 🚀 🤩.
Use prod data and schema during development
Problem definition
Most analytics engineers work with development data that is stale and even worse, they don't know about it 😱. Let's rewind that a bit. Organizations using dbt™*, run production jobs that update their production data warehouse continuously throughout the day. During those runs, the prod warehouse has the freshest schema and freshest data.
Regarding building dbt™* models during development, analyst use the development schema and dev warehouse. The dev warehouse typically only has partial data or schema. Some orgs refresh the dev warehouse nightly with partial or full replication of prod to keep dev current. In some other orgs, analysts run a black box script to copy prod to dev before they start work every morning. Whatever the scenario, users always run into the problem that during working during the day, the dev schema falls behind prod.
When you are working in a large team with 20 or more analytics engineers, commits happen throughout the day and there are some pipelines that run every 10 minutes. So every 10 minutes, dev can fall behind prod.
This disconnect between prod and dev leads to the following consequences:
Building and testing dbt™* models with changing data and / or schema of upstream dependencies becomes tedious and time consuming. One needs to always manually refresh dev with prod schema or build all models in dev again.
The business logic you have built and tested is faulty because you have used synthetic data and not real data. e.g., you have only pivoted across allowable values in a column in dev, but prod has more possibilities.
Existing solutions
Currently, there are a few ways one can go about this with varying degrees of success:
Cloning prod to dev: If you are a dbt Core™* user with a local setup, you can spend about 10 - 20 minutes/developer/day cloning prod to dev on Snowflake - might take longer on Redshift, BigQuery or Databricks and do this multiple times during the day.
Using defer +state:modified: Keep downloading the most recent manifest.json as many times as you like and replacing your local one - pretty tedious, annoying, and can easily take 20mins/dev/day assuming you do this ~5 times per day.
using dbt-Cloud: sorry, this is not possible in dbt Cloud™* 😞.
The Paradime way 🔥
At Paradime, we want to provide our customers with the ultimate flexibility in getting their stuff done. Our customers push the boundaries of what is possible. They don't have time and they want to challenge the status quo.
So, our customers, when they run their production dbt™* jobs on Paradime, can now in a single dbt™* command, develop continuously using production data and production schema as can be seen from the graphic below:
Paradime modified --defer --schedule-name | Analytics Engineering using dbt™
Paradime customers just need to add the --defer --schedule-name=<bolt-schedule-name> modifier at the end of their dbt™* run command and we will take care of fetching and replacing local manifest.json. We will do this with zero overhead for the customer while improving the development experience by orders of magnitude.
Currently, this is a fairly advanced feature. In a next iteration, we are going to make this even more user-friendly, so that platform leaders can roll this out to their entire analytics team with zero training or learning overhead.
So if you move your dbt™* production jobs to the Paradime Bolt scheduler, this feature can be used straightaway.
Bolt runtime variables
In Paradime Bolt, we now expose a set of runtime variables that can be injected as separate columns into materialized tables through the dbt™* Jinja-SQL. The variables are:
PARADIME_WORKSPACE_NAME (the Paradime workspace from where the job was triggered)
PARADIME_SCHEDULE_NAME (the Bolt schedule name that triggered the job)
PARADIME_SCHEDULE_RUN_ID (the Bolt run-id that wrote the data in the table)
PARADIME_SCHEDULE_RUN_START_DTTM (the date and time when the run started)
PARADIME_SCHEDULE_TRIGGER (one of scheduler, manual from user@email.com, API, TURBO CI)
PARADIME_SCHEDULE_GIT_SHA (the commit SHA from git)
DBT_MODEL_VERSION (experimental placeholder for dbt™* v1.5.x)
These variables can be referenced like standard environment variables in dbt™* model SQL.
Conclusion
That's all for now as we prepare for Paradime v3.0 and we hope you will benefit from this latest product release. We will be back again with a few more updates. In the meantime, have a great rest of the week - keep building and getting stuff done.
Sign up for a FREE 14-day trial or schedule some time with our team to learn more about Paradime 💪