Is Big Data killing Data Warehouse?
There was a lot of discussions about Big Data killing Data Warehouse in the last few years. Of course, that was not possible because Big Data is a technology, and Data Warehouse is a schema-on write analytical architecture.
In the meantime, many new technologies and methodologies have emerged in the market and became mainstream. Many companies have invested in Data Lakes and collected enormous volumes of raw and unprocessed data, both structured and non-structured. All this data was supposed to be prepared, combined, and analyzed by super-smart data scientists using the schema-on-read approach. You can find many resources on the Internet that compare Data Warehouse and Data Lake architectures, and here we will use one of the most common images.
Eventually, investment in Data Lakes, in most cases didn’t pay off. Sometimes Data Lakes lacked data quality and governance and turned into data swamps. In other cases, there were no business cases that would create additional business value from data collected in Data Lake. But in a majority of cases, there were no available data science talent to dig the golden data nuggets from those huge data repositories.
Meanwhile, Data Warehouses in most of the large organizations were somewhat put aside, with only necessary maintenance and minimal development investment. In some cases, their technology and architecture became obsolete because technologies used ten years ago are now too expensive to handle increasing data volumes, and performance is far lower than modern multi parallel processing (MPP) databases.
Save investments in Data Lake – build a Data Lakehouse!
Fortunately, there is a way to save huge investments in Data Lake, and the answer is to build a Data Lakehouse. The definition is very simple – Data Lakehouse is the architecture that includes Data Lake for raw and unprocessed data in staging and foundation layers and modern Data Warehouse architecture built on top of Data Lake. With such architecture that combines good characteristics of both Data Lake and Data Warehouse, the company has all raw detailed data available for schema-on-read Data Science queries and consistent schema-on write for standard business usage reporting and analytics.
Data Lakehouse architecture and processes are shown in the following picture:
The extraction and ingestion layer brings structured and unstructured data into Data Lake using streaming or batch processing. The staging layer is used for temporary data storage, and processing and the foundation layer keeps historical changes of loaded data. The manual entry area is used for reference data and additional analytical attributes and hierarchies that don’t exist in source data. Data Warehouse is based on a modern industry-standard data warehouse data model and consists of the detailed base layer and aggregated performance and analytical layer. A common Process Control Framework manages all end-to-end data pipelines.
Business users can access data in Data Warehouse layers using standard BI and analytical platforms, while Data Scientists can use raw data from Data Lake and processed data from Data Warehouse and combine inputs for more complex and powerful models, either using Data Preparation tools or Data Virtualization layer.
Data Lakehouse architecture can be implemented on-premise, in the cloud, or using a hybrid approach – technologies for each of the approaches are available and mature, and they don’t affect the architecture itself; it is only the matter of the implementation.
I strongly believe that this kind of architecture is the foundation of future analytical systems.