The Data Game Has Changed

7 min readOct 24, 2019

The concept of a data warehouse was first introduced in the late 1980’s. The warehouses allowed organizations to bring their operational data to a central location for analysis. To take advantage of the data, the data needed to be modelled (shaped) efficiently to meet the needs of the end users and systems. As the concept and technology progressed, the architecture took a clear direction and is the bedrock of many reporting systems. This was also the path that Buildium first traveled not too long ago.

In this article, I want to highlight some of the “a-ha” moments that we had in our path to a data solution at Buildium. From understanding the impact of ETL, to data and team flexibility, to actually understanding our requirements and priorities.

Where we began

You have probably seen something like this before, but in case you haven’t, here is the acronym breakdown:

OLTP — OnLine Transaction Processing. Where your transactional data is stored whether it is SQL or NoSQL.
ETL — Extract, Transform, Load. Take the data, clean the data, and make it look how you want it to look in the warehouse. Typically done through complex scheduling and code through tools like Microsoft SSIS, Talend, and Pentaho.
OLAP — OnLine Analytical Processing. Your data engine that is tuned to support the consumers of the data. Redshift, Oracle, SAP, and Hadoop are the common solutions.
Bonus: DataMart — A data baby of the warehouse. Uses the aggregation of the data in the warehouse to support another line of the business. (e.g. marketing automation)

The typical architectural diagram for a warehouse solution looks simple, but is always deceiving. Behind every box marked “ETL” are hundreds if not thousands of batch jobs or stream processing layers. The box marked “warehouse” will hide the hours spent tuning the size of the cluster to perform at peak usage, but also not bleed money when idle. Determining how to orchestrate loading data into the warehouse while reading data out can be monumental.

At Buildium, we did a couple of proof-of-concept projects to demonstrate the performance of a warehouse to serve reports to our customers. Our projects focused on modelling and properly partitioning the data. We isolated our problem to tuning the system to the queries our customers were going to execute. With little effort and little surprise, the results were clear that columnar storage could outperform our current reporting environment based on our transactional system.

However, we tackled the easy problem. The real challenge is how to get the data to the warehouse in a dependable and highly scalable way?

You said the game changed…

The emphasis on ETL was historically driven by the limitations of the warehouse. There was material cost overhead to an inefficient schema or the storage of unneeded and unclean data. But that has changed. With data warehouse solutions like Snowflake, Azure, and BigQuery, the cost for compute is a fraction of what it once was.

The new data solutions have introduced a fundamental change. ETL is on its way out of the data architecture. Push your data to powerful and scalable cloud services and do the work there.

A-ha #1: ELT vs ETL

When folks weren’t looking, ETL shifted its acronym to ELT (Extract Load Transform) and data lakes were born. While a typical warehouse is data refined for a purpose, a data lake is data- in-the-raw. Dump your data to a central location and model it on demand. Various architectural patterns exist to postpone the “T” as long as possible. Or remove it entirely if you are using the data for data science. Don’t get me wrong, a warehouse still has its purpose and still exists in the modern architecture. Buildium’s Lead Data Architect, Tammy Hamilton, says it like this:

“A data lake and a data warehouse are two different things with two different purposes and generally two different audiences. Data lakes are for data scientists and situations where the questions to be asked are not known and data is kept in its original transactional data structure. There is no filtering or aggregation or data governance or master data management or transformation applied. A data warehouse is the opposite. It typically has a dimensional structure using fact and dimension tables. It goes through master data management, data governance, quality and other business logic related transformations.”

So there is a need for both in the end, but there is power in getting the data into a centralized place in a simple way. What is the secret behind a highly scalable data lake?

Files.

Stop laughing.

Using cheap cloud storage like S3 or Azure Data Lake, you move data in its raw form to a central location. Then, focus highly scalable cloud computing resources on ingesting and exposing the data to the consumers in the form they require. Yes, this might mean a data warehouse, but it doesn’t have to.

A-ha #2: Creating efficient transformations and flexible Teams

Central to the “old way” was that in order to efficiently get the data out of the warehouse, data needed to be appropriately structured for the warehouse before it arrives. In ETL, long running jobs would query data from the source system and stitch it together with what exists in the warehouse. Typically, this is done overnight in order to avoid the production load. A side effect is that the jobs are tightly coupled to the source system. Which is bad. To prevent production failures, the data team and owner of the source system required a process for introducing new columns or fields and have that data replicated to the warehouse.

With ELT, your primary concern is moving data. The source team will still need to abide by the same rules for change data capture that they would for a typical ETL process. However, the data team has more autonomy and can own how the data is modelled and processed for downstream consumption. With this clear separation, the source teams can focus on a high performing transactional systems, choosing the data repository that fits the need of the product. Meanwhile, the data team can focus on consuming the raw data and processing it with the tech and resources designed for that work.

A-ha #3: Admit it. You probably don’t need “real time data” to start

At Buildium, as we set off to build out our data solutions, we thought that data in real time was key to any project we were undertaking. That decision set us back a couple months. In the world of data movement, there are some complex and established solutions for consuming data in near real time. Using event streaming through services like Kafka, you can either process events in memory or stage data for loading into the warehouse.

But do you need to?

Moving data at speed comes with a cost when you push the data to your warehouse. Consider who or what is consuming the data and the expectation for updates. Concurrency issues, compute time and re-indexing will increase on the warehouse as the frequency of the data load gets faster. Eventually, you will need to make trade-offs between data availability vs the price and complexity of your solution.

When we were confronted with this decision, we decided that the power and benefits of moving large amounts of data to a data lake outweighed our need for real time data. Once in the data lake we now have access to all of our data in a single place to analyze, model, transform and bring back into the product to provide more value to our customers. Will we need to solve the real time problem? Probably. But with far less effort, we are providing a lot more value.

Where we are now

Only a short time ago, modelling the data before the warehouse was the most important part of building a data warehouse. Now, your focus should be to get your data to the data lake in the raw and use that compute power and cheap storage to enrich, refine and create new data. Design your system to be flexible to empower your product teams and data teams alike. And finally, keep the value of data as a whole in mind rather than just the timeliness and complexity of the solution so that you can provide the most for your customers and business.

About the author

Jordan Krey is a Director of Engineering at Buildium. Joining in 2013, he now manages our payments, leasing and data teams. His work on Buildium’s business intelligence platform has allowed our business to be run through data. Now, he is working towards using our data to provide value back to our customers. As a manager, he is passionate about career development and writes regularly on his blog. You can follow him at kreyz.com or in twitter @jordankrey.