DuckDB - a primer

A primer on DuckDB, with code examples, extensions, use cases and how to use with dbt™ for data transformation including with Paradime.

August 1, 2024
A reading icon
7
 min read
DuckDB - a primer

Ever wished you had a database that's as easy to use as a spreadsheet but as powerful as a full-fledged SQL engine? Meet DuckDB, the plucky underdog of the data world that's making waves faster than a duck on a water slide.

What the Duck Is It?

DuckDB is an embeddable SQL database that fits in your pocket (well, your computer's memory). It's like having a miniature data warehouse right on your laptop, ready to crunch numbers faster than you can say "quack."

Why Should You Give a Duck?

  1. Speed: DuckDB flies through your data like a peregrine falcon. It's designed for analytical queries, so it'll slice and dice your numbers before you can blink.
  2. Simplicity: No complicated setup, no server configuration, no fuss. It's as easy as importing a library in Python or R.
  3. Portability: DuckDB travels light. Your entire database is just a single file, making it perfect for data scientists on the go.
  4. Compatibility: Plays nice with others. DuckDB can read and write Parquet files, CSVs, and even connect directly to Pandas DataFrames.
  5. Free as a Bird: Open-source and free to use. No need to break the piggy bank.

How Does It Stack Up?

  • SQLite: DuckDB's cousin in the embedded database world. But while SQLite is great for transactions, DuckDB soars when it comes to analytics.
  • Pandas: Love Pandas but hate memory issues? DuckDB can handle data larger than RAM without breaking a sweat.

Getting Started

Ready to take the plunge? Here's a quick quack-start guide:

1import duckdb
2
3# Create a connection (or use :memory: for in-memory database)
4con = duckdb.connect('quack.db')
5
6# Run a query
7con.execute("SELECT 'Hello, DuckDB!' AS greeting").fetchall()

Boom! You're now a DuckDB developer. It's that easy.

Advanced Duckery

Don't let its simplicity fool you. DuckDB packs a punch with features like:

  • Window functions
  • Complex joins
  • Vectorized query execution
  • Automatic indexing
  • Parallel query processing

We will dive next into each one of them with examples.

1. Window Functions
1-- Create a sample table
2CREATE TABLE sales (
3    employee_id INT,
4    sale_date DATE,
5    sale_amount DECIMAL(10, 2)
6);
7
8INSERT INTO sales VALUES
9(1, '2023-01-01', 100),
10(1, '2023-01-02', 150),
11(2, '2023-01-01', 200),
12(2, '2023-01-02', 300);
13
14-- Use a window function to calculate running total
15SELECT 
16    employee_id,
17    sale_date,
18    sale_amount,
19    SUM(sale_amount) OVER (
20        PARTITION BY employee_id 
21        ORDER BY sale_date
22    ) AS running_total
23FROM sales;
2. Complex Joins
1-- Create additional tables
2CREATE TABLE employees (
3    employee_id INT,
4    name VARCHAR(50),
5    department_id INT
6);
7
8CREATE TABLE departments (
9    department_id INT,
10    department_name VARCHAR(50)
11);
12
13-- Perform a complex join
14SELECT 
15    s.employee_id,
16    e.name,
17    d.department_name,
18    s.sale_amount
19FROM sales s
20LEFT JOIN employees e ON s.employee_id = e.employee_id
21LEFT JOIN departments d ON e.department_id = d.department_id
22WHERE s.sale_date = '2023-01-01'
23    AND d.department_name = 'Sales';
3. Vectorized Query Execution:

DuckDB automatically uses vectorized query execution. You don't need to explicitly enable it. Here's an example that would benefit from vectorization:

1-- Create a large table
2CREATE TABLE large_data AS 
3SELECT 
4    range AS id,
5    random() AS value
6FROM range(1000000);
7
8-- Run a query that benefits from vectorization
9SELECT 
10    AVG(value) AS avg_value,
11    SUM(value) AS sum_value,
12    COUNT(*) AS count
13FROM large_data
14WHERE id % 2 = 0;
4. Automatic Indexing

DuckDB uses automatic indexing for certain operations. You don't need to manually create indexes. Here's an example where DuckDB might use automatic indexing:

1-- Create a table with a lot of rows
2CREATE TABLE big_table AS 
3SELECT 
4    range AS id,
5    range % 1000 AS category,
6    random() AS value
7FROM range(10000000);
8
9-- Run a query that might benefit from automatic indexing
10SELECT 
11    category,
12    AVG(value) AS avg_value
13FROM big_table
14WHERE category BETWEEN 100 AND 200
15GROUP BY category
16ORDER BY category;
5. Parallel Query Processing

DuckDB automatically uses parallel query processing when possible. You can control the number of threads used with the threads setting:

1-- Set the number of threads (optional, DuckDB chooses automatically by default)
2SET threads TO 4;
3
4-- Run a query that can benefit from parallel processing
5SELECT 
6    category,
7    COUNT(*) AS count,
8    AVG(value) AS avg_value,
9    MIN(value) AS min_value,
10    MAX(value) AS max_value
11FROM big_table
12GROUP BY category
13HAVING COUNT(*) > 1000
14ORDER BY count DESC
15LIMIT 10;

DuckDB Extensions

DuckDB's extension ecosystem is continuously growing, providing additional functionality to the core database. Here are some popular and useful extensions for DuckDB:

1. httpfs:

This extension allows DuckDB to read and write data directly from HTTP(S) sources. It's particularly useful for working with remote datasets without needing to download them first.

Example usage:

INSTALL httpfs;
LOAD httpfs;
SELECT * FROM read_csv_auto('https://example.com/data.csv');
2. Spatial:

The spatial extension adds support for geospatial data types and functions, allowing DuckDB to handle geographic and geometric data.

Example usage:

INSTALL spatial;
LOAD spatial;
SELECT ST_GeomFromText('POINT(0 0)') AS geom;
3. JSON:

This extension enhances DuckDB's JSON handling capabilities, allowing for more complex operations on JSON data.

Example usage:

INSTALL json;
LOAD json;
SELECT json_extract('{"a": 1, "b": 2}', '$.a') AS value;
4. Iceberg:

The Iceberg extension allows DuckDB to read and write Apache Iceberg tables, which is particularly useful for data lakehouse architectures.

Example usage:

INSTALL iceberg;
LOAD iceberg;
SELECT * FROM iceberg_scan('path/to/iceberg/table');
5. Parquet:

While DuckDB has built-in Parquet support, the Parquet extension adds additional functionality for working with Parquet files, including improved writing capabilities.

Example usage:

INSTALL parquet;
LOAD parquet;
COPY (SELECT * FROM my_table) TO 'output.parquet' (FORMAT PARQUET);

These extensions significantly expand DuckDB's capabilities, allowing it to handle a wider range of data sources and types. The popularity and usefulness of extensions are continuously changing over time as new ones are developed and existing ones evolve. For the most current information, checkout the DuckDB Extensions docs.

DuckDB and dbt™

Analysts can supercharge their data workflows by combining dbt's transformation capabilities with DuckDB's lightning-fast analytics engine. This powerful duo enables efficient local development and rapid data modeling. Here we illustrate 2 popular use cases of using dbt™ and DuckDB together.

Use Case 1: Local Development and Testing

Analysts can use dbt with DuckDB to develop and test data models locally before deploying to production. This setup allows for quick iterations and validation of complex transformations without the need for a full data warehouse. By using DuckDB's in-memory capabilities, analysts can work with substantial datasets right on their laptops.

Use Case 2: Data Pipeline Orchestration

For smaller to medium-sized datasets, analysts can build entire data pipelines using dbt™ and DuckDB. This combination is particularly useful for ETL processes, where data can be extracted from various sources, transformed using dbt™ models, and loaded into DuckDB for analysis or further processing.

Popular DuckDB extensions enhance these workflows:

  1. The 'httpfs' extension allows direct reading from remote sources, enabling analysts to pull data from APIs or cloud storage directly into their dbt models.
  2. The 'parquet' extension optimizes reading and writing Parquet files, which is crucial for efficient data storage and retrieval in dbt pipelines.

By leveraging these tools together, analysts can create robust, efficient, and easily testable data transformation workflows that saves money too.

Top Uses for DuckDB

Here are the top 5 use cases for DuckDB:

1. Data Analysis and Exploration:

DuckDB excels at rapid data analysis, making it ideal for data scientists and analysts who need to quickly explore and query large datasets. Its ability to handle data larger than RAM and its compatibility with tools like Python and R make it perfect for interactive data exploration.

2. ETL (Extract, Transform, Load) Operations:

DuckDB's speed and ability to handle complex SQL operations make it excellent for ETL processes. It can efficiently read from various data sources (like CSV, Parquet, or JSON), perform transformations, and write the results to different formats or databases.

3. Local Data Warehousing:

For smaller to medium-sized datasets, DuckDB can serve as a lightweight, local data warehouse. It offers many features of larger data warehouse systems but without the need for complex setup or infrastructure, making it ideal for individual users or small teams.

4. Data Preprocessing for Machine Learning:

Machine learning workflows often require extensive data preprocessing. DuckDB's speed and SQL capabilities make it an excellent tool for cleaning, transforming, and preparing data for machine learning models, especially when working with structured data.

5. Embedded Analytics in Applications:

Due to its embeddable nature and small footprint, DuckDB is well-suited for adding analytical capabilities directly into applications. Developers can integrate powerful data processing features into their software without the need for a separate database server.

These use cases highlight DuckDB's strengths in handling analytical workloads efficiently, especially in scenarios where simplicity, portability, and performance are key factors. Would you like me to elaborate on any of these use cases or provide examples?

Who's using DuckDB?

1. Motherduck:

While not entirely open-source, Motherduck is building a cloud service around DuckDB. They've contributed significantly to the DuckDB ecosystem and have open-sourced several tools and extensions for DuckDB, including connectors and utilities.

2. Apache Superset:

Apache Superset is a popular open-source data exploration and visualization platform. It has integrated DuckDB as one of its supported database engines, allowing users to leverage DuckDB's performance for data analysis and visualization directly within the Superset interface.

3. Evidence:

Evidence is an open-source business intelligence tool that uses DuckDB as its primary database engine. It allows users to write SQL queries and create data visualizations, leveraging DuckDB's performance for fast data processing.

4. Ibis:

Ibis is an open-source Python project that provides a unified way to work with various database backends. It has added support for DuckDB, allowing users to leverage DuckDB's capabilities through the Ibis API.

The open-source ecosystem around DuckDB is continually evolving, with new projects and integrations emerging regularly. I'd recommend checking the DuckDB GitHub repository or the official DuckDB website for the latest developments.

The Bottom Line

In a world of complex and expensive datawarehouse solutions, DuckDB is a freaking awesome. It's fast, it's simple, and it just works. Whether you're crunching numbers for a small business or analyzing terabytes of data from S3 buckets, DuckDB has got your back.

So next time you're drowning in data, remember: When it comes to analytics, it's duck season all year round. Give DuckDB a try and watch your productivity soar!

And lastly, if you want a cloud environment to work on DuckDB and don't want to manage and maintain infrastructure on your own, Paradime's got your back too. The Paradime Code IDE comes with native support for DuckDB and MotherDuck so you can do data transformation work without spending money on cloud compute.

Interested to learn more?
Try out the free 14-days trial
Close Cookie Preference Manager
By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage and assist in our marketing efforts. More info
Strictly Necessary (Always Active)
Cookies required to enable basic website functionality.
Oops! Something went wrong while submitting the form.