III. Analytics Engineering Benchmarks IRL
To understand the current landscape of analytics engineering, we analyzed real-world usage data from Paradime users, focusing on their interaction with dbt™ and other analytics tools. This analysis reveals both dominant patterns and more advanced or emerging practices in the field that can be distilled from looking at larger or more complex projects (the number of models in a project or the number of maintainers can be considered proxies for maturity and complexity).
3.1 Frequency of data pipeline runs
There is a lot of chatter around streaming data and high-frequency data ingestion. That said, 75% of projects are running at least one job at a daily frequency, and only 13% are running a job at a frequency higher than hourly. This distribution implies that daily cadences still dominate at most organizations.
Nearly 75% of projects run production jobs on a daily frequency
Despite buzz around real-time and streaming data, a significant majority of projects rely on daily jobs. Only 13% use frequencies higher than hourly.
The percentage of projects running jobs at frequency peaks at the daily cadence, decreasing sharply as frequencies rise from several times a day to hourly and approach high frequencies approaching truly real-time data pipelines.
For context, dbt™ projects in Paradime average between 2 to 13 production jobs, depending on their size (as measured by the number of models). Projects with up to 50 models average two jobs, while those with 201 to 500 models average 13.
When looking across projects at the distribution of production job frequency in aggregate, a similar picture emerges.
50% of data pipeline runs are scheduled to run daily
22% run several times a day
11% run monthly
7% run weekly
6% run several times a week
2% run hourly
2% are high frequency, i.e., scheduled in real-time or near real-time (e.g., every few minutes)
In other words, over half, or 72% of total jobs, are set to run daily or several times a day. Meanwhile, 24% are set at a frequency of several times a week or lower — i.e., several times a week, weekly, or monthly.
Less than 1 in 20 jobs, 4%, are scheduled hourly or more frequently (e.g., every few minutes).
This data implies that the cost of running pipelines daily (or less frequently) exists in balance with most organizations’ data needs, and that higher frequencies would not deliver ROI to justify the added cost and overhead at many companies.
That’s not to say that streaming and high-frequency data pipelines are not important in specific industries and use cases, as we learned from several data leaders. There’s growing adoption of higher-frequency and real-time data processing in industries like hedge funds and asset management.
3.2 Testing
Our analysis of Paradime user data reveals significant gaps in testing across analytics engineering teams. These insights shed light on current trends and point to areas where AI might be leveraged to improve data quality.
Overall test coverage is low, with an average of only 8.7% of models across projects having at least one test.
Test coverage — or the percentage of models with at least one test — is inversely related to project size:
- In smaller projects with 1 to 50 models, an average 18.5% of those models, or nearly 1 in 5, had a test.
- In medium-sized projects with 51 to 200 models, that average falls to 10%.
- In projects with 201 to 500 models, coverage continues to drop to an average of 8.8%.
- In projects with 500+ models, the average is 8.2%.
- The same relationship holds for other test metrics, including percentage of models with tests other than non-null tests and percentage of columns with tests. Coverage drops in a stair-step fashion as project size increases, with the biggest drop-off in coverage coming once projects grow larger than 50 models.
The fact that test coverage drops with project size isn’t surprising in and of itself. As projects scale in size, it’s difficult for engineers to keep pace and sustain the generation of tests. AI-powered test generation can help close this gap, but we’ve seen that test-generation in AI is still not as popular in usage terms compared to other AI-powered tools.
However, if we accept project size as a proxy for data project complexity and/or maturity, the drop-off in test coverage is worrisome. The implication is that the more mature and more complex the data project, the higher the proportion of models relying on untested data.
Test coverage for fewer than 10% of models across projects
Across all projects on Paradime.
(c) 2024, Paradime Labs, Inc.
Across all projects in Paradime, only 8.7% of models have at least one test.
Another insight is that If we look for tests beyond fundamental non-null tests, the proportion of test adoption drops further. For example:
In small projects with between 1 to 50 models, an average of 18.5% of models had at least one test, but that average dropped to 13.9% for models with at least one test that was not a “not-null” test. That’s a significant drop, meaning that only roughly 14 in 100 models, on average, had testing beyond non-null tests.
Among projects with 201 to 500 models, an average 8.8% of models had at least one test, but an average 7.5% had at least one test that was not a “not-null” test.
Not-null tests, which gauge the proportion of data in specified columns that are not nulls (vs. null values), are often the first type of test implemented in a data project. They serve as a starting point for more comprehensive testing. As with overall coverage, testing beyond non nulls declines with project size, ranging from an average 14% of models in small projects to only 7% in extra-large projects.
Moreover, when looking at average column coverage, coverage also drops significantly as project size increases. In small projects, only 7% of columns have tests. In medium to extra-large projects, an average of 3.5% or fewer columns are tested.
Testing coverage declines with projects size, with a steep drop-off beyond 50 models
Data underscores the need for test generation assistance, particularly as projects grow in size and complexity.
Testing coverage declines steadily with project size, with only 7% of projects with more than 500 models including tests other than non-null tests.
Putting all the test data together:
On average, a low percentage of models have any testing coverage at all, and the larger the project is, the lower the average test penetration (e.g., in projects with 500+ models, an average of only 8.2% of models have tests).
Among models with tests, a significant proportion are running only non-null tests.
Column coverage averages between 7% for small projects (those with up to 50 models) and 2% for extra-large projects (which have 500 or more models). This means that even in models with tests, only a fraction of the data is being validated.
The implications for data quality are obvious: This low average test coverage at the model and column level, especially in larger projects, suggests many organizations are leaving broad swaths of their data untested.
While teams will doubtless focus tests on the most crucial models and columns, any gaps increase the risk of errors propagating through data pipelines and negatively impacting business decisions.
Further, the drop in test coverage when looking beyond non-null checks suggests there’s a significant reliance on the most basic tests.
AI-powered test generation, if it becomes more widely adopted, presents a clear opportunity for improving test coverage.
3.3 Materialization strategies
Our analysis of Paradime usage data reveals interesting patterns in how organizations leverage different materialization strategies. These strategies play a crucial role in optimizing data pipeline performance and resource utilization.
3.3.1 Incremental models: Balancing efficiency and complexity
Incremental models update only new data rather than performing full rebuilds, resulting in meaningfully shorter runtimes. The strategy is particularly impactful with resource-intensive transformations since it allows for the transformation to focus only on new records.
Adoption of this strategy shows a nuanced pattern:
Small projects (1-50 models): 5% of models
Medium projects (51-200 models): 11% of models
Large projects (201-500 models): 14% of models
Extra-large projects (500+ models): 5% of models
Adoption of advanced materialization strategies increases with indicators of project maturity and complexity, like number of models – except in very large projects
In dbt™ projects in Paradime that have 201 to 500 models, ~14% of models are incremental
Incremental materialization is used in roughly 9% of all models across projects but in a greater proportion (14%) in large projects with between
201 and 500 models.
The pattern holds when employing a different indicator of project maturity and complexity — the number of maintainers on a project.
Adoption of advanced materialization strategies tends to increase with indicators of project maturity and complexity, like number of maintainers
In dbt™ projects in Paradime that have 11 to 15 maintainers, ~14% of models are incremental
Incremental materialization usage grows with indicators of project maturity and complexity, like the number of maintainers — up to a point, dipping lower for projects with the most maintainers.
This “inverted U” pattern reveals that while incremental models become more prevalent as projects grow, there’s a slight decrease in adoption for the largest projects.
This could be due to:
Complexity threshold: At extreme scales, the overhead of configuring and managing many incremental models might outweigh their benefits.
Alternative strategies: Very large projects might employ different optimization techniques or architectural patterns to manage costs and drive efficiency.
Diverse use cases: Extremely large projects might serve a wider variety of needs, some of which may not benefit from incremental models.
3.3.2 Snapshot models: Tracking historical changes
Snapshot models capture point-in-time data for historical analysis and are often used to track changes in data over time. While less widely used across the board, Snapshots show a clear trend of increased adoption as project size grows:
Small projects (1-50 models): 0.35% of models
Medium projects (51-200 models): 0.85% of models
Large projects (201-500 models): 1.27% of models
Extra-large projects (500+ models): 1.85% of models
This pattern suggests that as projects become larger — and likely more complex and mature — the need for historical tracking of slowly changing dimensions becomes more critical.
For example, larger projects often face stricter data governance and auditing requirements, which Snapshots can help address by providing point-in-time views and facilitating data lineage analysis. As well, larger and more mature projects are more likely at larger firms, which may have a propensity to do the sophisticated lead tracking and historical backtesting that Snapshots support.
Snapshot usage increases with dbt™ project size
Nearly 2% of models in “extra-large” projects with 500 or more models use Snapshot
Snapshot usage is positively correlated with the number of models in a project.
3.3.3 Ephemeral models
Ephemeral models are not built into databases and do not persist, so they allow companies to perform lightweight transformations “in flight” and avoid cluttering their data platforms. These models can’t be queried, so they are limited in their downstream utility. Not surprisingly, ephemeral models don’t see adoption in small projects with up to 50 models but otherwise roughly follow a version of the “inverted U” pattern noted earlier with incremental models.
Small projects (1-50 models): 0% of models
Medium projects (51-200 models): 2.03% of models
Large projects (201-500 models): 6.82% of models
Extra-large projects (500+ models): 0.97% of models
Average across projects: 2.6% of models
Large projects of between 201 and 500 models have an average of nearly 7% ephemeral models, roughly three times more than the percentage seen in medium-sized projects. Prevalence of ephemeral models drops off to less than 1% in the largest projects, very likely for similar reasons to the tapering-off seen at that project size for incremental models.
3.4 Optimizing production jobs
Another way to look at optimization is to see whether analytics engineers are refining their jobs to minimize the unnecessary use of resources.
The select and exclude flags are used to optimize data pipeline runs by allowing users to run only specific parts of their project, rather than the entire project. This targeted approach translates to significantly reduced execution time and processing costs, especially in larger projects.
However, these flags aren’t always used to the extent one might expect.
“People tend to use just dbt run
as a whole with no selectors and no way of refining the set of commands they want to run,” says Alexandre Carvalho at Stenn. “That’s actually pretty bad practice. The implication is that you’re rerunning the whole project, and there are significant costs and time associated with that.”
In the data, we see a marked tendency toward increased usage of selective runs as project size grows. Of course, large projects have more of a need for these flags, as there’s more to potentially select or exclude. Still, the data suggests that there are unseized opportunities:
The exclude flag, on its own, is essentially unused across projects with up to 500 models, and appears in only 4% of commands in extra-large projects with more than 500 models.
In large projects, 1 in 10 commands are running without either flag. In extra-large projects with more than 500 models, 23% of jobs are running with neither flag.
Snapshot usage increases with dbt™ project size
Nearly 2% of models in “extra-large” projects with 500 or more models use Snapshot
(c) 2024, Paradime Labs, Inc.
Optimization tactics, like use of the select and exclude flags in commands, are more widely used in larger projects, but many jobs still run with neither flag.
To put this data in context, the average number of commands per job also varies by project size, peaking at roughly 3 commands in large projects with 201 to 500 models:
Small projects (1-50 models): 1.8 commands
Medium projects (51-200 models): 1.2 commands
Large projects (201-500 models): 3.3 commands
Extra-large projects (500+ models): 2.4 commands
The data suggests that most dbt™ jobs, regardless of project size, tend to use relatively simple command structures. Even in medium projects, which can contain as many as 200 models, the average number of commands is hardly greater than one.
3.5 Terminal commands
Further underscoring the heavy reliance on the most common terminal commands, an analysis of actions on Paradime shows that more than two-thirds of command executions are for dbt run
. That means dbt run
is roughly seven times as used as the next most common terminal command, dbt compile
, with 9.8% of executions.
Newer commands, like dbt list
, see small shares of overall usage at less than one-tenth of a percent.
There is a strong “Power Law” distribution in favor of dbt run
, in particular, but also dbt compile
, dbt build
, and dbt test
. Together, these four terminal commands account for more than 90% of executed commands.
On the other side of the Power Law distribution, the rest of the 15+ commands together account for only around 8% of actions. For example, dbt run-operation
, which invokes macros, only accounts for 1.6% of terminal commands. It’s difficult not to suggest that analytics engineers could be exploring less common features more aggressively.
3.6 dbt™ bloopers
We refer to command line and other mistakes in dbt™ as “bloopers,” and some are head scratchers. They do offer an interlude of amusement amid a lot of dry data and charts.
Some are comprehensible, for example, dbt runoperation
(missing the dash in “run-operation”) and dbt seeds
(not plural!). Others less so: dbt json
does not closely resemble a command in dbt™ (or any other technology) we can think of. What’s surprising is the frequency with which some users are typing these commands. While dbt runoperation
generates an error, it has been tried more than 6,000 times. The mysterious dbt json? Nearly 2,000 times.
From head-scratchers to crossing wires with Git, we hope you enjoy the comic relief!
Unrecognized command!?
The most common dbt™ bloopers on Paradime. Users tried to run dbt runoperation
over 6,000 times.