OLAP vs. OLTP
The universe of data jargon is pretty largeā today, we have data warehouses, lakes, and lakehouses. While I simply want to hike to Delta Lake, itās now a storage technology, too.
Data warehouses are the OG of data jargonā coined by the IBM Systems Journal in 1988, a data warehouse is described as āa system used for the analysis and reporting of structured and semi-structured data.ā
Weāll go even simpler than thatā itās just a relational database. Good, at least that part was easy.
Storage Wars
The biggest choice in data might just be database selection. If youāre starting a data team, whether to use a cloud-native data warehouse (BigQuery, Snowflake) or a cloud-hosted traditional database (Postgres, MySQL) will occupy a large portion of your time.
These are more commonly referred to as OLAP (Online Analytical Processing) and OLTP (Online Transactional Processing) databases, respectively. Weāll talk about why you might choose one or the other and the future of data warehousing.
Why OLTP?
OLTP systems are engineered to handle transactional data originating from multiple users. This usually takes the form of a row-oriented database. Many traditional database systems are OLTP: Postgres, MySQL, etc.
If you currently work at a company and you hear engineers talk about a āprodā database (or you work on the prod database), itās likely an OLTP store. Why? Because OLTP is great for inserting and retrieving transactional data (one row at a time).
To insert a transaction, you need to write an entire row. This is perfect, since OLTP systems operate by row. Data is written and retrieved one row at a time. Historically, databases were used to power production systems, so they were designed to effectively write and retrieve one thing at a time, really well. Cool, right?
Why OLAP?
As data became big, this started to present a problem. What happens when you need to query 1 million rows? 10 million rows? Since we just talked about how row-oriented databases work, letās assume you want to sum revenue by month over a table with 10 million rows. Since OLTP sources read data one row at a time, youāll have to process every column of every row in the table. Not good.
Furthermore, traditional OLTP databases require configuration. Youāll have to understand horizontal vs. vertical scaling, provisioning, and a whole bunch of DevOps stuff. Thatās not a bad thing per se, but itās not data. Itās DevOps. So now, as a data engineer, youāll have to go through a DevOps team, which can be a huge bottleneck, or learn DevOps, which isnāt part of your competitive advantage (Hint: itās data).
From this need, we saw the rise of āserverlessā OLAP systems. These systems, think BigQuery, Snowflake, and Redshift allow for elastic scale in both storage and transactional volume. That means they can scale up to process a demanding query, then return to baseline once executed. Running low on space? You can add more storage with a single click. Theyāre highly optimized for analytic workloads.
What do we mean by āanalytic workloads?ā Well, tasks that are common in analytics: aggregating data, joining data, etc. These correspond to GROUP BY, JOIN, and WINDOW in SQL.
By definition, analytic workloads are not evenly distributed. You might have a complex query that takes 10 minutes to run, then 5 hours of downtime, then another complex query and an ETL pipelineā serverless makes sense to scale up/down according to uneven loads throughout the day (and hopefully no load at night).
Now letās look back at our revenue query. Before, we were doing something like this:
No bueno. But now, weāre reading one column at a timeā so that ten million row aggregation? You only need to scan two columns, date and revenue. In a table with 10 columns, thatās 20% of the original data! Now it looks something like:
So now weāre crazy efficient⦠and our systems can automatically scale. Thatās the power of serverless OLAP systems.
OLAP systems are most commonly used by analytics and data science teams for their speed, stability, and low maintenance cost. To summarize:
The Future of Data
So, there are a lot of folks out there saying ādata warehouses are dead,ā but most of them have products that are built entirely on data lakes and competitors with data warehousing solutions.
āData warehouses are dead. Oh and we sell data lakehouses as a service.ā
Now, donāt get me wrong, data lakes are pretty cool, but most data teams still need a data warehouse. More than the architecture, they provide massive amounts of computing power with a SQL interface that anyone can understand. That means analysts and analytics engineers can write SQL workflows that parallelize and process relatively large amounts of data, with virtually zero upkeep.
Once ubiquitous, tech like Spark and Hadoop is now reserved for the largest data teamsā Facebook, Netflix, etc. So, unless youāre working with petabyte-scale data, data warehouses are still very much useful.
But data is a fast moving space and the future is bright. Tools like DuckDB and metadata management like Delta, Hudi, and Iceberg are paving the way for systems that can read Parquet directly into a relational system or create virtualized databases that can leverage in-memory, relational operations.
If weāre going to do data puns, they might as well be ducking good.
Personally, Iām excited for the day we can have āvirtualā (in quotes because, well, everything is virtual now) data warehouses that sit atop Parquet files and act nearly identically to current warehousing solutions. I welcome our new Duck overlords. I think weāre far from that point, but until then, a boy can dream.