A primer on DuckDB, with code examples, extensions, use cases and how to use with dbt™ for data transformation including with Paradime.
Ever wished you had a database that's as easy to use as a spreadsheet but as powerful as a full-fledged SQL engine? Meet DuckDB, the plucky underdog of the data world that's making waves faster than a duck on a water slide.
DuckDB is an embeddable SQL database that fits in your pocket (well, your computer's memory). It's like having a miniature data warehouse right on your laptop, ready to crunch numbers faster than you can say "quack."
Ready to take the plunge? Here's a quick quack-start guide:
1import duckdb
2
3# Create a connection (or use :memory: for in-memory database)
4con = duckdb.connect('quack.db')
5
6# Run a query
7con.execute("SELECT 'Hello, DuckDB!' AS greeting").fetchall()
Boom! You're now a DuckDB developer. It's that easy.
Don't let its simplicity fool you. DuckDB packs a punch with features like:
We will dive next into each one of them with examples.
1-- Create a sample table
2CREATE TABLE sales (
3 employee_id INT,
4 sale_date DATE,
5 sale_amount DECIMAL(10, 2)
6);
7
8INSERT INTO sales VALUES
9(1, '2023-01-01', 100),
10(1, '2023-01-02', 150),
11(2, '2023-01-01', 200),
12(2, '2023-01-02', 300);
13
14-- Use a window function to calculate running total
15SELECT
16 employee_id,
17 sale_date,
18 sale_amount,
19 SUM(sale_amount) OVER (
20 PARTITION BY employee_id
21 ORDER BY sale_date
22 ) AS running_total
23FROM sales;
1-- Create additional tables
2CREATE TABLE employees (
3 employee_id INT,
4 name VARCHAR(50),
5 department_id INT
6);
7
8CREATE TABLE departments (
9 department_id INT,
10 department_name VARCHAR(50)
11);
12
13-- Perform a complex join
14SELECT
15 s.employee_id,
16 e.name,
17 d.department_name,
18 s.sale_amount
19FROM sales s
20LEFT JOIN employees e ON s.employee_id = e.employee_id
21LEFT JOIN departments d ON e.department_id = d.department_id
22WHERE s.sale_date = '2023-01-01'
23 AND d.department_name = 'Sales';
DuckDB automatically uses vectorized query execution. You don't need to explicitly enable it. Here's an example that would benefit from vectorization:
1-- Create a large table
2CREATE TABLE large_data AS
3SELECT
4 range AS id,
5 random() AS value
6FROM range(1000000);
7
8-- Run a query that benefits from vectorization
9SELECT
10 AVG(value) AS avg_value,
11 SUM(value) AS sum_value,
12 COUNT(*) AS count
13FROM large_data
14WHERE id % 2 = 0;
DuckDB uses automatic indexing for certain operations. You don't need to manually create indexes. Here's an example where DuckDB might use automatic indexing:
1-- Create a table with a lot of rows
2CREATE TABLE big_table AS
3SELECT
4 range AS id,
5 range % 1000 AS category,
6 random() AS value
7FROM range(10000000);
8
9-- Run a query that might benefit from automatic indexing
10SELECT
11 category,
12 AVG(value) AS avg_value
13FROM big_table
14WHERE category BETWEEN 100 AND 200
15GROUP BY category
16ORDER BY category;
DuckDB automatically uses parallel query processing when possible. You can control the number of threads used with the threads
setting:
1-- Set the number of threads (optional, DuckDB chooses automatically by default)
2SET threads TO 4;
3
4-- Run a query that can benefit from parallel processing
5SELECT
6 category,
7 COUNT(*) AS count,
8 AVG(value) AS avg_value,
9 MIN(value) AS min_value,
10 MAX(value) AS max_value
11FROM big_table
12GROUP BY category
13HAVING COUNT(*) > 1000
14ORDER BY count DESC
15LIMIT 10;
DuckDB's extension ecosystem is continuously growing, providing additional functionality to the core database. Here are some popular and useful extensions for DuckDB:
This extension allows DuckDB to read and write data directly from HTTP(S) sources. It's particularly useful for working with remote datasets without needing to download them first.
Example usage:
INSTALL httpfs;
LOAD httpfs;
SELECT * FROM read_csv_auto('https://example.com/data.csv');
The spatial extension adds support for geospatial data types and functions, allowing DuckDB to handle geographic and geometric data.
Example usage:
INSTALL spatial;
LOAD spatial;
SELECT ST_GeomFromText('POINT(0 0)') AS geom;
This extension enhances DuckDB's JSON handling capabilities, allowing for more complex operations on JSON data.
Example usage:
INSTALL json;
LOAD json;
SELECT json_extract('{"a": 1, "b": 2}', '$.a') AS value;
The Iceberg extension allows DuckDB to read and write Apache Iceberg tables, which is particularly useful for data lakehouse architectures.
Example usage:
INSTALL iceberg;
LOAD iceberg;
SELECT * FROM iceberg_scan('path/to/iceberg/table');
While DuckDB has built-in Parquet support, the Parquet extension adds additional functionality for working with Parquet files, including improved writing capabilities.
Example usage:
INSTALL parquet;
LOAD parquet;
COPY (SELECT * FROM my_table) TO 'output.parquet' (FORMAT PARQUET);
These extensions significantly expand DuckDB's capabilities, allowing it to handle a wider range of data sources and types. The popularity and usefulness of extensions are continuously changing over time as new ones are developed and existing ones evolve. For the most current information, checkout the DuckDB Extensions docs.
Analysts can supercharge their data workflows by combining dbt's transformation capabilities with DuckDB's lightning-fast analytics engine. This powerful duo enables efficient local development and rapid data modeling. Here we illustrate 2 popular use cases of using dbt™ and DuckDB together.
Analysts can use dbt with DuckDB to develop and test data models locally before deploying to production. This setup allows for quick iterations and validation of complex transformations without the need for a full data warehouse. By using DuckDB's in-memory capabilities, analysts can work with substantial datasets right on their laptops.
For smaller to medium-sized datasets, analysts can build entire data pipelines using dbt™ and DuckDB. This combination is particularly useful for ETL processes, where data can be extracted from various sources, transformed using dbt™ models, and loaded into DuckDB for analysis or further processing.
Popular DuckDB extensions enhance these workflows:
By leveraging these tools together, analysts can create robust, efficient, and easily testable data transformation workflows that saves money too.
Here are the top 5 use cases for DuckDB:
DuckDB excels at rapid data analysis, making it ideal for data scientists and analysts who need to quickly explore and query large datasets. Its ability to handle data larger than RAM and its compatibility with tools like Python and R make it perfect for interactive data exploration.
DuckDB's speed and ability to handle complex SQL operations make it excellent for ETL processes. It can efficiently read from various data sources (like CSV, Parquet, or JSON), perform transformations, and write the results to different formats or databases.
For smaller to medium-sized datasets, DuckDB can serve as a lightweight, local data warehouse. It offers many features of larger data warehouse systems but without the need for complex setup or infrastructure, making it ideal for individual users or small teams.
Machine learning workflows often require extensive data preprocessing. DuckDB's speed and SQL capabilities make it an excellent tool for cleaning, transforming, and preparing data for machine learning models, especially when working with structured data.
Due to its embeddable nature and small footprint, DuckDB is well-suited for adding analytical capabilities directly into applications. Developers can integrate powerful data processing features into their software without the need for a separate database server.
These use cases highlight DuckDB's strengths in handling analytical workloads efficiently, especially in scenarios where simplicity, portability, and performance are key factors. Would you like me to elaborate on any of these use cases or provide examples?
While not entirely open-source, Motherduck is building a cloud service around DuckDB. They've contributed significantly to the DuckDB ecosystem and have open-sourced several tools and extensions for DuckDB, including connectors and utilities.
Apache Superset is a popular open-source data exploration and visualization platform. It has integrated DuckDB as one of its supported database engines, allowing users to leverage DuckDB's performance for data analysis and visualization directly within the Superset interface.
Evidence is an open-source business intelligence tool that uses DuckDB as its primary database engine. It allows users to write SQL queries and create data visualizations, leveraging DuckDB's performance for fast data processing.
Ibis is an open-source Python project that provides a unified way to work with various database backends. It has added support for DuckDB, allowing users to leverage DuckDB's capabilities through the Ibis API.
The open-source ecosystem around DuckDB is continually evolving, with new projects and integrations emerging regularly. I'd recommend checking the DuckDB GitHub repository or the official DuckDB website for the latest developments.
In a world of complex and expensive datawarehouse solutions, DuckDB is a freaking awesome. It's fast, it's simple, and it just works. Whether you're crunching numbers for a small business or analyzing terabytes of data from S3 buckets, DuckDB has got your back.
So next time you're drowning in data, remember: When it comes to analytics, it's duck season all year round. Give DuckDB a try and watch your productivity soar!
And lastly, if you want a cloud environment to work on DuckDB and don't want to manage and maintain infrastructure on your own, Paradime's got your back too. The Paradime Code IDE comes with native support for DuckDB and MotherDuck so you can do data transformation work without spending money on cloud compute.