DuckDB is an in-process SQL OLAP database management system
DuckDB is an embedded, columnar database for data science and analytical workloads. Data analysts usually load the data locally in tools like pandas or data.table to quickly analyze patterns and form hypotheses before scaling the solution in the server. However, we can now use DuckDB for such use cases, because it unlocks the potential to do larger than memory analysis. DuckDB supports range joins, vectorized execution and multiversion concurrency control (MVCC) for large transactions.
- In-process, serverless
- C++11, no dependencies, single file build
- APIs for Java, C, C++, and others.
- Transactions, persistence
- Extensive SQL support
- Direct Parquet & CSV querying
Fast Analytical Queries
- Vectorized engine
- Optimized for analytics
- Parallel query processing
- Free & Open Source
- Permissive MIT License
DuckDB development started in July 2018. The main features of DuckDB are:
- Simple installation
- Embedded: no server management
- Single file storage format
- Fast analytical processing
- Fast transfer between R/Python and RDBMS
- Does not rely on any external state. For example, separate config files, environment variable.
- Single-File storage format
- Composable interface. Programmatic Fluent SQL API
- Fully ACID through MVCC
Google Trends search data for “DuckDB”
As visible through the Google Trends search data above — during the past 12 months there has been a growing discussion and palpable hype around DuckDB in the data community.
But what’s this hype all about? Let’s scratch the surface a little bit.
DuckDB is an easy-to-use open source in-process OLAP database (that processes data in memory and doesn’t require a dedicated server/service) — described by many in simplified terms as the SQLite equivalent for analytical OLAP workloads.
As an in-process database, DuckDB is a storage and compute engine that enables developers, data scientists, data engineers and data analysts to power their code with extremely fast analyses using plain SQL. Further, DuckDB has the capability to analyze data where it might live, e.g. on the laptop or in the cloud. Additionally, DuckDB comes with a simple CLI for quick prototyping — without the need for setup, permissions, creating and managing tables, etc.
Its performance for analytical workloads on single-node machines seems to be impressive and the setup is pain-free (you can technically start exploring DuckDB within 5 minutes).
DuckDB is embeddable like SQLite and is optimized for analytics. The big deal here is the embeddable part (like a library without bringing in the typical PostgreSQL dependency), eliminating the network latency you usually get when talking to a database.
DuckDB has also a really low deployment effort.Further DuckDB is fast — compared to querying Postgres, DuckDB is 80X faster and when benchmarking other systems we can see similarly impressive results.
These are some of the reasons DuckDB has witnessed impressive growth over the past 12 months.
In reality, any CPU may be used to do effective analytics with DuckDB. Furthermore, DuckDB has no external dependencies and is portable and modular. This specifically means that you can execute DuckDB on your laptop, a cloud virtual machine, a cloud function, or any of the previously listed platforms.
DuckDB use cases
There are two important use cases for DuckDB :
Interactive data analysis: Many organisations are leveraging data scientists to make sense of the data so that they can make better business decisions. Today, the most popular way data scientists explore the data on their local environments is by writing Python or R code using libraries like Pandas, dplyr, etc. DuckDB gives another alternative to data scientists who want to use SQL for their local development work. SQLite does not shine here because it is slow for OLAP workloads and it does not have all the functions required for data analytics work.
Edge computing: This use case is becoming popular with the rise of edge computing in the last couple of years. Edge computing is a distributed computing paradigm which brings computation and data storage closer to the location where it is needed, to improve response times and save bandwidth. With embeddable databases like DuckDB data can be analyzed on the edge giving better results faster.
There are many database management systems out there. But as noted by the DuckDB creators: there is no one-size-fits-all database system. All take different trade-offs to better adjust to specific use cases. DuckDB is no different.
When you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single-user workloads. Enter DuckDB.
DuckDB allows an entire community of SQL enthusiasts to be instantly productive in Python without ever learning more than very basic Pandas. There’s a growing number of data community members who never use Pandas for anything complex anymore because they favor SQL.
For more details contact email@example.com
- No similar blogs