DuckDB - a primer
A primer on DuckDB, with code examples, extensions, use cases and how to use with dbt™ for data transformation including with Paradime.
Kaustav Mitra
Aug 1, 2024
·
7
min read
Ever wished you had a database that's as easy to use as a spreadsheet but as powerful as a full-fledged SQL engine? Meet DuckDB, the plucky underdog of the data world that's making waves faster than a duck on a water slide.
What the Duck Is It?
DuckDB is an embeddable SQL database that fits in your pocket (well, your computer's memory). It's like having a miniature data warehouse right on your laptop, ready to crunch numbers faster than you can say "quack."
Why Should You Give a Duck?
Speed: DuckDB flies through your data like a peregrine falcon. It's designed for analytical queries, so it'll slice and dice your numbers before you can blink.
Simplicity: No complicated setup, no server configuration, no fuss. It's as easy as importing a library in Python or R.
Portability: DuckDB travels light. Your entire database is just a single file, making it perfect for data scientists on the go.
Compatibility: Plays nice with others. DuckDB can read and write Parquet files, CSVs, and even connect directly to Pandas DataFrames.
Free as a Bird: Open-source and free to use. No need to break the piggy bank.
How Does It Stack Up?
SQLite: DuckDB's cousin in the embedded database world. But while SQLite is great for transactions, DuckDB soars when it comes to analytics.
Pandas: Love Pandas but hate memory issues? DuckDB can handle data larger than RAM without breaking a sweat.
Getting Started
Ready to take the plunge? Here's a quick quack-start guide:
Boom! You're now a DuckDB developer. It's that easy.
Advanced Duckery
Don't let its simplicity fool you. DuckDB packs a punch with features like:
Window functions
Complex joins
Vectorized query execution
Automatic indexing
Parallel query processing
We will dive next into each one of them with examples.
1. Window Functions
2. Complex Joins
3. Vectorized Query Execution:
DuckDB automatically uses vectorized query execution. You don't need to explicitly enable it. Here's an example that would benefit from vectorization:
4. Automatic Indexing
DuckDB uses automatic indexing for certain operations. You don't need to manually create indexes. Here's an example where DuckDB might use automatic indexing:
5. Parallel Query Processing
DuckDB automatically uses parallel query processing when possible. You can control the number of threads used with the threads
setting:
DuckDB Extensions
DuckDB's extension ecosystem is continuously growing, providing additional functionality to the core database. Here are some popular and useful extensions for DuckDB:
1. httpfs:
This extension allows DuckDB to read and write data directly from HTTP(S) sources. It's particularly useful for working with remote datasets without needing to download them first.
Example usage:
2. Spatial:
The spatial extension adds support for geospatial data types and functions, allowing DuckDB to handle geographic and geometric data.
Example usage:
3. JSON:
This extension enhances DuckDB's JSON handling capabilities, allowing for more complex operations on JSON data.
Example usage:
4. Iceberg:
The Iceberg extension allows DuckDB to read and write Apache Iceberg tables, which is particularly useful for data lakehouse architectures.
Example usage:
5. Parquet:
While DuckDB has built-in Parquet support, the Parquet extension adds additional functionality for working with Parquet files, including improved writing capabilities.
Example usage:
These extensions significantly expand DuckDB's capabilities, allowing it to handle a wider range of data sources and types. The popularity and usefulness of extensions are continuously changing over time as new ones are developed and existing ones evolve. For the most current information, checkout the DuckDB Extensions docs.
DuckDB and dbt™
Analysts can supercharge their data workflows by combining dbt's transformation capabilities with DuckDB's lightning-fast analytics engine. This powerful duo enables efficient local development and rapid data modeling. Here we illustrate 2 popular use cases of using dbt™ and DuckDB together.
Use Case 1: Local Development and Testing
Analysts can use dbt with DuckDB to develop and test data models locally before deploying to production. This setup allows for quick iterations and validation of complex transformations without the need for a full data warehouse. By using DuckDB's in-memory capabilities, analysts can work with substantial datasets right on their laptops.
Use Case 2: Data Pipeline Orchestration
For smaller to medium-sized datasets, analysts can build entire data pipelines using dbt™ and DuckDB. This combination is particularly useful for ETL processes, where data can be extracted from various sources, transformed using dbt™ models, and loaded into DuckDB for analysis or further processing.
Popular DuckDB extensions enhance these workflows:
The 'httpfs' extension allows direct reading from remote sources, enabling analysts to pull data from APIs or cloud storage directly into their dbt models.
The 'parquet' extension optimizes reading and writing Parquet files, which is crucial for efficient data storage and retrieval in dbt pipelines.
By leveraging these tools together, analysts can create robust, efficient, and easily testable data transformation workflows that saves money too.
Top Uses for DuckDB
Here are the top 5 use cases for DuckDB:
1. Data Analysis and Exploration:
DuckDB excels at rapid data analysis, making it ideal for data scientists and analysts who need to quickly explore and query large datasets. Its ability to handle data larger than RAM and its compatibility with tools like Python and R make it perfect for interactive data exploration.
2. ETL (Extract, Transform, Load) Operations:
DuckDB's speed and ability to handle complex SQL operations make it excellent for ETL processes. It can efficiently read from various data sources (like CSV, Parquet, or JSON), perform transformations, and write the results to different formats or databases.
3. Local Data Warehousing:
For smaller to medium-sized datasets, DuckDB can serve as a lightweight, local data warehouse. It offers many features of larger data warehouse systems but without the need for complex setup or infrastructure, making it ideal for individual users or small teams.
4. Data Preprocessing for Machine Learning:
Machine learning workflows often require extensive data preprocessing. DuckDB's speed and SQL capabilities make it an excellent tool for cleaning, transforming, and preparing data for machine learning models, especially when working with structured data.
5. Embedded Analytics in Applications:
Due to its embeddable nature and small footprint, DuckDB is well-suited for adding analytical capabilities directly into applications. Developers can integrate powerful data processing features into their software without the need for a separate database server.
These use cases highlight DuckDB's strengths in handling analytical workloads efficiently, especially in scenarios where simplicity, portability, and performance are key factors. Would you like me to elaborate on any of these use cases or provide examples?
Who's using DuckDB?
1. Motherduck:
While not entirely open-source, Motherduck is building a cloud service around DuckDB. They've contributed significantly to the DuckDB ecosystem and have open-sourced several tools and extensions for DuckDB, including connectors and utilities.
2. Apache Superset:
Apache Superset is a popular open-source data exploration and visualization platform. It has integrated DuckDB as one of its supported database engines, allowing users to leverage DuckDB's performance for data analysis and visualization directly within the Superset interface.
3. Evidence:
Evidence is an open-source business intelligence tool that uses DuckDB as its primary database engine. It allows users to write SQL queries and create data visualizations, leveraging DuckDB's performance for fast data processing.
4. Ibis:
Ibis is an open-source Python project that provides a unified way to work with various database backends. It has added support for DuckDB, allowing users to leverage DuckDB's capabilities through the Ibis API.
The open-source ecosystem around DuckDB is continually evolving, with new projects and integrations emerging regularly. I'd recommend checking the DuckDB GitHub repository or the official DuckDB website for the latest developments.
The Bottom Line
In a world of complex and expensive datawarehouse solutions, DuckDB is a freaking awesome. It's fast, it's simple, and it just works. Whether you're crunching numbers for a small business or analyzing terabytes of data from S3 buckets, DuckDB has got your back.
So next time you're drowning in data, remember: When it comes to analytics, it's duck season all year round. Give DuckDB a try and watch your productivity soar!
And lastly, if you want a cloud environment to work on DuckDB and don't want to manage and maintain infrastructure on your own, Paradime's got your back too. The Paradime Code IDE comes with native support for DuckDB and MotherDuck so you can do data transformation work without spending money on cloud compute.