Skip to main content

Understanding Data Lakes and Data Warehouses

Data Warehouse

A data warehouse is a central repository of integrated data from one or multiple disparate sources.  It enables businesses to store historical data from software applications used by different departments in one place.  Here, data is selected and extracted according to a target need.  Before being uploaded to the data warehouse the data is cleansed to ensure data quality.

The data is then transformed, structured, and categorized, before being stored in the output zone of the data warehouse, called a data mart.  Managers and other business professionals can use data stored in the data mart to create performance reports, conduct online analytical processing, and to support decision-making.

Data Lake

Data lakes differ to data warehouses because they allow data to be stored within a system in its natural form such as CSV, Logs, XML, JSON, email, PDF, image, audio, and video.  Businesses can use data lakes for raw data without having to determine pre-defined needs.  Instead, the data can be stored just in case it may be useful in the future.

One big advantage is that businesses may not yet know the value of trends or patterns hidden in the raw data until they bring in data scientists, who will conduct data mining.  Data lakes therefore facilitate discovery of trends and patterns, for example by providing a more accurate forecast or identifying sales opportunities.

Comparison

Raw data cannot organise itself, so constructing and maintaining a data warehouse does require additional resources.  These costs are usually offset by the benefits brought to users.  This pay-and-get relationship is somewhat blurred in the case of data lakes.  Since it is collected in the absence of a precise goal, data is usually left in its raw format as a way to save costs.  Schemas are written in order to extract the useful information only when a specific need is identified, or where businesses embark on data mining.

The result is that data is kept in a more intact format.  For example, data related to customer service may include recordings of dialogue which include the voice and tone for customer calls. This would be captured in its entirety under data lakes.

In contrast, the same data stored in a data warehouse may only capture key points relating to the same exchange. The completeness of the raw data in a data lake provides the business with enriched opportunities for analysis compared to the data stored in a data warehouse.

Conclusion

Contrary to popular belief, data lakes are not a successor of data warehouses: they are distinct approaches to storing data that respond to different business needs and work complementarily to each other.  While data lakes are indispensable for data mining, data warehouses enable daily operations and ongoing monitoring.

Successful strategies will seek to incorporate both forms of storage to optimise use of one of the most valuable business assets: data.