Intro to Data Engineering — Part 2: Database, Data Warehouse and Data Lake

3 min readFeb 14, 2023

In the first part of this series, we have got the basic understanding of OLAP and OLTP systems. In this article we will try to debunk the differences between a database, datawarehouse and data lake.

A database is a collection of data or information. Databases are typically accessed electronically and are used to support Online Transaction Processing (OLTP). A variety of database types have emerged over the last several decades. Broadly, you can classify databases in two segments.

Relational databases store data in tables with fixed rows and columns.
Non-relational databases (also known as NoSQL databases) store data in a variety of models including JSON (JavaScript Object Notation), BSON (Binary JSON), key-value pairs, tables with rows and dynamic columns, and nodes and edges.

Popular databases include PostgreSQL, MongoDB etc.

However, databases are not optimized for handling large amounts of data all at once. This is where data warehouses come into play. Data warehouses are systems designed to store highly structured information from various sources, providing current and historical data in one place. The purpose of a data warehouse is to combine disparate data sources and create business intelligence (BI) reports and dashboards through data analysis and insights.

The question you may have is whether a data warehouse is a database. The answer is yes, a data warehouse is essentially a giant database optimized for analytics. The difference between a database and a data warehouse lies in their focus, as databases prioritize the most frequently used and recent data while data warehouses prioritize all current data from the database and historical data. Popular data warehouses include BigQuery, Redshift, Snowflake, and Azure DW.

While data warehouses provide a streamlined way to access data for analytics and predictive models, there’s another solution that has become the holy grail for analysts and data scientists: Data Lake. Before storing data in a data warehouse, data engineers usually perform some processing and cleaning to optimize storage efficiency. However, this process can sometimes result in the removal of valuable information that only an analyst or data scientist can identify.

To mitigate this, data lakes are used to store all kinds of data without preprocessing, regardless of size, structure, format, or occurrence. This allows analysts and data scientists to access the data lake for analytics and predictive modeling without missing any crucial information. Some popular data lake technologies include Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake.

Now you do understand, with proper coordination between the analytics and data engineering teams, there may be no need for a data lake as both teams can agree on what data to keep and what not to keep. However, data lakes provide a useful solution for those looking to store all kinds of data without preprocessing for the purpose of data-driven analysis and predictive modeling.

In the next part, we will learn about ETL vs ELT vs Data Engineering.

Intro to Data Engineering — Part 2: Database, Data Warehouse and Data Lake

Written by Hazzlenut

No responses yet