Home » Blog » Airbyte, The Future of Data Integration

Airbyte, The Future of Data Integration

Gartner has predicted that by 2025, 80% of organizations seeking to expand their digital businesses will fail because they have not adopted modern approaches to data and analytics governance.

The data ecosystem is the most important part of the infrastructure ecosystem, and the processing, distribution, and computation of data throughout the data circulation ecosystem are essential. Since data has been centralized in data warehouses and data lakes, data integration has undergone earth-shattering changes, which we now refer to as the modern data technology stack. However, what is modern today may become outdated tomorrow.

Nowadays, data governance is increasingly important, and we often find that 80% of data businesses are actually supported by 20% of the data, while 80% of data quality issues are actually caused by 20% of the systems and people. Gartner has predicted that by 2025, 80% of organizations seeking to expand their digital businesses will fail because they have not adopted modern approaches to data and analytics governance.

One of the most critical issues in this regard is data integration, which brings us to the E (extract), T (transform), L (load) and reverse ELT problems at the bottom of the modern data technology stack. It is expected that enterprises will continue to increase the number of internal connectors they must build and maintain in the future. Today, we would like to introduce a company — Airbyte, an open-source data integration platform that focuses on ELT pipelines.

In 2020, LiverRamp and Michel Tricot (former Engineering Director and Integration Lead at RideOS) and John Lafleur (a serial entrepreneur focused on developing tools and B2B services) co-founded Airbyte. Initially, the team wanted to focus on data connections for marketing companies and joined the YC accelerator with this idea. However, due to factors such as the pandemic, they were unable to succeed. It was at this time that the team decided to delve deeper into data integration, which is what we see as Airbyte today — a data engineering solution that is not limited to any specific industry, but offers a graphical UI for building connectors and APIs for developers to hook into.

The team believes that many companies start by building their own data connectors, and initially, the results may be good. However, in the long run, they gradually discover that the complexity lies in maintenance, and the cost of data integration is in the maintenance. Even for companies that specialize in building these connectors, their complexity will make it difficult for them to keep up with the development of connectors. Airbyte hopes to become the standard for replicating data.

Subsequently, from July to September 2020, the Airbyte team had 45 phone calls with leading customers who use ETL/ELT tools within three months. They found out that even if customers paid for these solutions, they still had to build and maintain connectors themselves because they were either unsupported or supported in ways that did not meet their needs.

In addition, most ETL/ELT platforms are cloud-based and require data to be moved out of the infrastructure. This not only adds unnecessary costs but also brings data privacy and security risks to more and more companies. In the end, engineers still have to develop and maintain these connectors themselves. It was based on this research that the Airbyte team became more convinced of the direction they had chosen.

Airbyte has been gaining more and more attention. According to information disclosed by Airbyte itself, its usage in November 2020 was twice that of October. Prior to February 2021, Airbyte achieved 100% growth every month with 500 deployments per month. The good data also attracted Accel’s $5.2 million seed round investment. Only three months later in May of the same year, a Series A round of $26 million led by Benchmark was completed.

By November 2021, Airbyte had deployed to 100k and the number of connectors had also grown rapidly. At this point, a $150 million Series B funding round led by Altimeter Capital and Coatue Management was completed, with the company valued at $1.5 billion. From its founding in 2020 to the end of 2021, the company surpassed unicorn valuations in less than two years.

To talk about ELT, we still need to start with traditional ETL. Traditionally, when we start building a data warehouse, we need to first understand the business processes, clarify how the business operates, and how data is traced. By collecting relevant user requirements, we can then plan and design reports. Enterprises need to carry out a series of operations such as data warehousing, layering, and logical modeling to build tables in the data warehouse.

After this, the enterprise needs to perform ETL operations. Since most data warehouses only accept SQL-based relational data structures, enterprises need to convert non-compliant data into SQL-based data. This approach is prevalent in local databases with limited memory and processing power. The main problem with ETL is that the process is long and cumbersome. If the business or underlying data changes frequently, the ETL process needs to be adjusted accordingly, which not only wastes time but is also limited by throughput and extremely expensive.

Therefore, ELT emerged. Engineers discovered that the complexity of ETL was mainly due to the strong coupling of T and L, so the core idea of ELT is decoupling. Unlike ETL, ELT does not require data transformation before the loading process. ELT directly loads raw data into the data warehouse. Using ELT data pipelines, processes such as data cleaning, enrichment, and data transformation are all completed within the data warehouse. The raw data is stored in the data warehouse indefinitely, allowing for multiple transformations.

Advantages of using ELT include breakthrough performance bottlenecks, simplified programs, component replacements, and reduced maintenance costs. In particular, after decoupling, it can adapt to agile business changes, and both flexibility and efficiency are greatly improved.

Airbyte's main products are still the Extract and Load products for data extraction and loading, respectively. Simply put, it connects data between multiple platforms using connectors. The more data sources connected by the platform, the more stable the platform will be, and the platform will have barriers to entry. Airbyte focuses on and embraces the open-source ecosystem.

Airbyte Official Website

Next, Airbyte also offers a Transform product, which actually integrates with an open-source tool called Dbt (Dbt Labs is also a unicorn valued at $4.2 billion). Users can use SQL statements to transform data with Transform. Here, we can also see the good ecosystem in the infrastructure field in the United States, where everyone focuses on their own domain and integrates with each other through the ecosystem, rather than creating products that try to do everything.

Airbyte Official Website

Lastly, Airbyte also offers an Embed product, which mainly solves the problem of duplicate construction of BI tools and front-end pages. After the company migrates data to the cloud, customized report requirements will need to build a data warehouse and BI tool. Through the Airbyte Embed product, this process is simplified. After the data is moved to the cloud data warehouse, analysis reports are automatically generated, saving time.

Airbyte Official Website

After talking so much about ETL and ELT, what is the opportunity for Airbyte, a rising startup unicorn focused on the ELT track? I think everything has to start with the cloud. With the rise of cloud computing, the process of cloudifying data warehouses has accelerated. Its features such as on-demand usage and elastic scalability have deeply influenced the entire infrastructure software industry’s transformation. In the early days of the industry, many so-called “cloud data warehouses” simply packaged the physical hardware environment directly onto the cloud, without separating storage and computation, and without realizing elastic scalability. This “cloudification” did not optimize the cloud environment’s characteristics.

The industry’s transformation came from Snowflake’s emergence as a cloud-native data warehouse in 2014. It began to deeply integrate with the cloud platform through a multi-cluster shared data storage and computation separation architecture. Traditional enterprises based on locally deployed resources are relatively expensive and restricted in terms of computation, storage, and network bandwidth. So it is understandable to put the T link in between E and L, as we need to balance hardware costs and computation efficiency.

However, Snowflake’s emergence as a cloud-native data warehouse has brought continuous cost reductions in enterprise computation and storage, meaning that enterprises can directly store untransformed data in the data warehouse. In fact, more and more data is being stored in the cloud, providing a fertile ground for the rise of ELT.

IDC report

On the other hand, we have to talk about the explosion of enterprise data. Data has become an essential element for modern enterprise success. More and more enterprises need to aggregate data, whether structured, unstructured, or semi-structured data, and they hope to collect and process it through a unified platform interface. It is precisely because of the growth of these data resources that has driven the digitalization process of enterprises. They need more flexible and agile ways to process data, and obviously, traditional ETL cannot meet these requirements.

IDC report

Airbyte’s business model is a typical open source business model, with a free version, cloud version, and enterprise version.

The open-source version can be used as a self-service, free solution. It can access unlimited connectors, replication, monitoring, and community support for users. In addition to providing all the features of the open-source version, the cloud version also provides cloud hosting services for its platform and charges based on credits. Its credit consumption is related to infrastructure computing time. It comes with cloud data hosting, data management, multiple workspaces, etc.

The cloud version offers a 14-day free trial period, after which it is charged at a rate of $2.50 per credit per month. The enterprise version is for users with large data processing needs, and it is charged based on customer use cases. Airbyte does not charge for failed customer use cases. Airbyte hopes to meet the industry’s demand for long-tail connectors through the open-source model and paid contributors program. In this respect, closed-source products are mostly unable to meet the requirements. In addition, they hope to speed up the industry’s use of their connectors through open source to improve product reliability.

In fact, open source has improved Airbyte’s business flywheel, accelerated its product improvement, and provided better competitive advantages. It allows active contributor communities to participate in releasing their own data connectors for the benefit of all, which is one of the important reasons for the rapid growth of their connectors.

At the product level, the open-source model helps connectors maintain a high level of reliability. Airbyte incentivizes open-source contributors to maintain their contributed connectors. Individuals and companies can also browse and download connectors from the Airbyte Marketplace, similar to the app store.

The open-source model also seems to have received recognition from capital. When Airbyte completed its Series B financing in December 2021, its ARR revenue was less than $1 million, but it received a valuation of $1.5 billion. According to Airbyte’s own website, it currently synchronizes more than 600 TB of data per month, has been used by more than 25,000 companies, and has over 10,000 community members. We have reason to continue to pay attention to and look forward to future financial information released by Airbyte to track its commercialization process.

Airbyte Official Website

A data integration platform that can quickly link data from different sources and build more connectors will gain industry barriers. This market is likely to have the characteristics of the Matthew effect and winner takes all. At the same time, Airbyte is not alone, and there are still new and old players entering the data integration market.

We see old players like Fivetran, the earliest ETL tool provider in the industry founded in 2012 (a unicorn company valued at $5.6 billion, currently shifting to the ELT field), also dedicated to building connectors for widely used platforms and data sources. Its advantage is that it is one of the most mature data integration platforms and trusted by some of the largest companies in the world, but its pricing is high, support for long-tail data connectors is limited, and the possibility of internal development is small. Of course, it is a closed-source model.

We also see new players like Meltano, which was spun off from GitLab in 2021 and operates in an open-source model. However, unlike Airbyte, it integrates the Singer protocol and does not yet offer no-code or low-code options, making it more suitable for data engineering teams with relatively high technical skills.

In any case, Airbyte’s story and challenges will continue, and we will continue to follow it.

Zheng Bo (Harbour Zheng) is famous entrepreneur in the field of 2B fundamental infrastructure. He is also the initiator of the CnosDB cloud-native time series database open-source community.