Home » Blog » A Rebel in Database Industry: Big Data’s Dead, MotherDuck’s Born

A Rebel in Database Industry: Big Data’s Dead, MotherDuck’s Born

“Big data” is dead – the most important thing we can do today is not to worry about the size of the data, but to focus on how we will use it to make better decisions.

A Rebel in Database Industry: Big Data’s Dead, MotherDuck’s Born

 

The database industry has evolved to the point where there has been a lot of acceleration and change at the data level, especially with the explosion of cloud data warehousing over the past few years, which has brought about many changes in the industry. There is no doubt that cloud data warehousing has become the cornerstone of the enterprise data stack, and companies and organizations of all sizes are accustomed to using data warehouses to analyze business data. The rapid rise of Snowflake is typical of this trend.

However, if we break down the big data changes into three dimensions: velocity, volume, and diversity, we find that the dimension of greatest concern is still velocity. When we revisit the definition of “big data” and combine it with the elements of data assets, our most important requirement is for low latency consumption for microservices on data assets processed from OLTP [1] databases.

Many Big Data departments have purchased all new tools and migrated from legacy systems, and they find that they still can’t make sense of the data — — perhaps the size of the data is not the problem at all. Of course, data is getting bigger, but the hardware is also getting bigger at a faster rate. Meanwhile, vendors are still pushing to scale the capabilities of the hardware. Today we’re going to talk about a database startup that thinks a little bit “differently” — MotherDuck — and see how their product, DuckDB, makes sense of the big data era.

The history of MotherDuck begins with the product DuckDB, a purpose-built in-process online analytical processing database management system designed to enable efficient data analysis. In just two years, from 2019, when the first open-source version of DuckDB was released, to 2021, the number of weekly downloads of DuckDB grew rapidly. At this point, the project, originally created by the Dutch Mathematical and Computer Science Research Institute (CWI), was spun off into a separate operation, and project researchers Hannes Mühleisen and Mark Raasveldt founded DuckDB Labs.

At this point in the story, why hasn’t MotherDuck appeared yet? Don’t worry, we are still missing another lead — Jordan Tigani, founding engineer of Google Big Query, who also follows DuckDB and has been looking to bring a lightweight database product to the market. After talking to DuckDB Labs co-founder Mühleisen and gaining support, Tigani began trying to commercialize the open source DuckDB. The new company, MotherDuck, was born and received a $12.5 million angel round led by Redpoint Capital (US) and a $35 million Series A round led by A16Z, valuing the company at $175 million.

In retrospect, this is a great deal of capital recognition for a startup that has only been around for a short time. Since DuckDB is not MotherDuck’s original open-source product, it is important to have the support of the project’s founding team if we want to build a long-term and stable service based on an open-source product in the future.

The DuckDB team is somewhat involved in the partnership with MotherDuck, which in turn is a member of the DuckDB Foundation, the non-profit organization that owns most of DuckDB’s intellectual property, and DuckDB Labs, DuckDB’s own commercial arm, is a shareholder in MotherDuck. Tigani’s partnership with DuckDB Labs is a smart move, and one that binds the interests of both parties.

To talk about DuckDB, we need to start with SQLite, which can be considered as the most used relational database system in the world. It is ubiquitous on almost every phone, every browser and operating system, and it even runs on airplanes.

Since SQLite is embedded, it does not require an external server to manage it. Also, he is bound to almost every language and it is based on these features that make it easier to use and we must admit the greatness of SQLite. But at the same time, its problems stand out. SQLite is designed for OLTP, uses row storage, can’t use memory to speed up computation, and has a very limited query optimizer, so it is very unfriendly for analytics.

It is for this reason that DuckDB saw an opportunity. Simply put, it is SQLite for analytics (OLAP domain [2]) and as an in-process database it enables developers, data scientists, data engineers and data analysts to support its code with very fast analytic capabilities using pure SQL. In addition, it has the ability to analyze data where it may exist, such as on a laptop or in the cloud.

DuckDB uses a columnar vectorized query engine that still interprets the query but processes a large number of vectors in a single operation, thereby reducing the overhead of processing each row sequentially in traditional systems such as PostgreSQL, MySQL or SQLite, and improving query performance.

SQLite is a small relational database that can be used for in-process deployments.

                                                                                                     DuckDB in quadrant

Unlike most companies in the industry, MotherDuck has a different set of industry beliefs.

First, Tigani believes that most customers and organizations have modest data stores that are not large. Also, customer data sizes obey a power-law distribution. The largest customer has twice as much storage as the second largest customer, the third largest customer has half as much storage as the second largest customer, and so on. Therefore, although some customers have hundreds of petabytes of data, the size decreases quickly.

Second, there is a storage bias in storage and computation separation, where data size grows faster than computation. If the business is static, neither growing nor shrinking, the data grows linearly over time, but the compute requirements do not change much because most analysis is done on recent data. This stock-computing bias makes it possible that we do not need distributed processing at all. Also, many users want simple and fast answers to their questions — they don’t want to wait for the cloud.

Finally, most data is rarely queried. A large percentage of the data that gets processed is less than 24 hours old. By the time data has been stored for a week, it is perhaps 20 times less likely to be queried than the most recent day. Historical data tends to be rarely queried, which means that the size of the data working set is more manageable than we might expect. If there is a PB table containing 10 years of data, that data may end up being compressed to less than 50 GB. so many cloud vendors focus on 100 TB of query performance, which may not only be irrelevant to most users, but distract from their ability to deliver a great user experience.

As a result, MotherDuck put forwards its argument that big data is real, but most people probably don’t need to worry about it. “Big Data” is dead — The most important thing we can do today is not to worry about the size of the data, but to focus on how we will use it to make better decisions. We also ask ourselves from time to time, do organizations really generate a lot of data? And if they do, do they really need to use a lot of data at once? And if so, is it really too big to fit on one machine? Perhaps different organizations will give different answers.

We are currently living in an era of rapid change that has given rise to many database management systems. As we have seen, there is no one-size-fits-all database management system in the world yet. Everyone takes different trade-offs to better fit a particular use case, and this is true for DuckDB as well. Sometimes we need to focus on serving multiple concurrent users, and sometimes we need an embedded database that is very fast for single-user workloads.

Will DuckDB succeed? The answer may not be certain. But we do see a vibrant open source community forming, and while no commercialization has been disclosed yet, we should give this Series A company the patience it deserves — after all, the story is just beginning.

                                                                                  DuckDB’s stars changes on Github

Notes:

[1] OLTP: On-Line Transaction Processing Online transaction processing, also known as transaction-oriented processing.

[2] OLAP: Online Analytical Processing Online Analytical Processing. Online Analytical Processing OLAP is a software technology that enables analysts to quickly, consistently, and interactively look at information from all sides to achieve a deeper understanding of the data.

Author Info: Harbour Zheng, a middle-aged infrastructure startup veteran, and the founder of the open source community of CnosDB cloud-native temporal database.