A house sits on a lakefront with fall trees surrounding it.

What is a Data Lakehouse: A Guide

Jayesh Patil
,
Data Solutions Architect
,
Aug 15, 2023

This post serves as a guide to help understand what a data lakehouse is and what it can do. Read more about emerging trends in the lakehouse space here. Search Discovery is a data lakehouse consultant and solution provider.

What is a Data Lakehouse?

A lakehouse is a single, open-data platform that allows you to unify all your data in open-data formats with a single catalog, governance, and security controls. A lakehouse allows you to create a foundation for all data analysis, Business Intelligence (BI), and AI workloads.

Right now, data engineers and data scientists leaning forward in their seats to hear what companies are doing to capitalize on AI. Because of lakehouse capabilities to enable BI and AI on all data, these platforms are serving as a technology pioneer that will popularize AI and ML technologies for nontechnical users, data engineers, and data scientists alike.

Data Warehouse vs. Data Lake vs. Data Lakehouse

Data Warehouse

A data warehouse pulls ONLY structured data from a data lake for analysis. It usually doesn’t support open standard formats like a data lake; however, its closed, proprietary data format (SQL only) means It can provide high-quality data with fine-grained security and governance. A data warehouse prepares data for analysis within BI tools, SQL applications, and decision support. However, scaling a data warehouse becomes expensive. Further, each separate data warehouse requires individual governance architecture.

Data Lake

A data lake is a “fluid,” open-format data storage container for all types of data (structured, semi-structured, textual, and unstructured—raw—data). But data lakes are difficult to govern, and data quality from this storage is low. It’s basically a dumping place for all the data in your business. You can connect machine learning to your data lake, but you can’t connect architecture to it for BI or SQL analysis.

Data Lakehouse

A data lakehouse adds to your data lake features of a data

warehouse. In other words, you still get your open format data storage, and all your data types can live in it (structured, semi-structured, textual, and unstructured), AND you get the high quality, reliable data with fine-grained security and governance that’s usable for BI tools, SQL applications, and machine learning. Unlike a data warehouse, a data lakehouse scales to hold any amount of data at a low cost, regardless of the data type.

Data warehouse vs. data lake vs. data lakehouse
Image source

Key Features of a Data Lakehouse

  • Openness
  • Scalability
  • Support for diverse data types
  • Choice of languages for processing
  • Separation of storage and compute
  • Schema enforcement and governance
  • High performance and concurrency support for diverse workloads (data science, ML, SQL, analytics)
  • ACID transactions
  • BI support with direct access to source data
  • Version history and streaming support

Background on the Evolution of the Data Lakehouse

Data warehouse technology has been around since the 1980s and has a long history of providing decision support and business intelligence applications. Massive Parallel Processing (MPP) architecture developed, and warehouses were able to hold more data, but not unstructured or semi-structured data or data with high variety, velocity, and volume.

About ten years ago, companies began building data lakes as storehouses for raw data. But data lakes didn't quite fit the vision architects had for building a single system to house data for different analytics products and workloads, because they didn't support transactions, enforce data quality, and couldn't mix appends and reads or batch and streaming jobs.

One approach brands use is to use a common two-tier data architecture to stitch these systems together to enable BI and ML across the data in both systems. This approach gets organizations closer to automated data initiatives while maintaining legacy analytics and BI workflows, but it results in duplicate data, extra infrastructure costs, regular maintenance, security challenges, and high operational costs.

In another approach, brands employ multiple systems, for example, a data lake, several data warehouses, and other systems like streaming, time series, graph, and image databases. The same benefits apply as the two-tier data architecture described above, but all of these combined systems also introduce complexity, silos, and delay.

What Technology Makes a Data Lakehouse Work?

New technology that enables the data lakehouse includes the following:

  • Metadata layers: These layers act as a go-between for unstructured data and the data user in order to categorize and classify the data.These sit on top of open file formats, like Parquet files, and track which files are part of different table versions. Metadata layers empower rich management features like ACID-complient transactions, and they support streaming I/), time travel to old table versions, schema enforcement and evolution, and data validation.¬†
  • Query engine designs for high-performance SQL execution on data lakes: These can cache hot data in RAM/SSDs and provide data layout optimizations to cluster co-accessed data. They have auxiliary data structures including statistics and indexes and allow for vectorized execution on modern CPUs.
  • Access for data science and machine learning tools: The open data formats (e.g., Parquet and OCR) make it easy for data scientists and machine learning engineers to access the data in the data lakehouse and use popular DS/ML ecosystem tools like pandas, TensorFlow, and PyTorch.

What are Data Lakehouse Examples?

  • Databricks is the fastest growing lakehouse solution that is supported across all the major cloud platforms. Databricks, creator of Apache Spark, ML Flow, and Delta Lake, provides a single unified data analytics platform for BI and AI use cases. Lakehouse solution uses Delta Lake for data reliability and performance, and the Unity Catalog is used for fine-grained governance. It is based on open source standards and adds transactional processing guarantees with performance benefits in the data lake.

    *Search Discovery is a Databricks partner.

  • Google BigLake is a storage engine built on years of innovations in BigQuery storage. It allows uniform and consistent access through open source query engines to multi cloud object stores like S3 and Google Cloud storage (see all our Google Cloud Solutions here). BigLake removes the need to duplicate data between data lakes and warehouses and allows interoperability across multi-cloud platforms. Google's Dataplex provides a single, centralized data governance solution for managing access policies and classification. BigLake with Dataplex provides a robust lakehouse solution built on open source technologies supporting business intelligence and data science workloads.

    *Search Discovery is a Google Premier Partner, offering Google Marketing Platform and Google Cloud Platform services and solutions.

Do You Need a Data Lakehouse?

You may have a data lake and a data warehouse (or several!). But these data storage solutions may not solve all your business needs, based on the type of data you collect. You might need a data warehouse if:

  • You want to analyze unstructured data (from text, IoT, images, audio, drones, etc.)
  • You want to run AI on your data warehouse
  • Your SQL analysts need an easy way to query your data lake
  • A data warehouse/ data lake approach isn't meeting your company's data demands

For these tasks and more, a data lakehouse is a powerful answer. However, each company should consider their individual business needs when considering whether a data lakehouse is the right option. They can be complicated to build from scratch, and you'll want to have a good idea of platform options and their capabilities before you decide to buy anything.

What are the Benefits of a Data Lakehouse?

A data lakehouse combines the best of both former data storage approaches: cheaper storage, more performant queries, simplified data governance, automatic addition of new data, and direct access to raw data. The separation of storage and computing allows users of the Lakehouse to right-size the resources needed when working. A lakehouse is a single platform with the ability to query across structured and unstructured data to accomplish both Business Intelligence and Data Science needs.

How Can Search Discovery Help You With Data Lakehouse Solutions?

When you work with our data engineering experts, we deliver more value than other partners because of our experience and deep expertise in analytics and data science. Get data lakehouse solutions that deliver: 

  • A finely-tuned, mission-purposed data platform
  • Reduced cost and data redundancy by simplifying data sources
  • Faster turnaround time for data science projects
  • Expert data science consulting services to take your insights to the next level

We help clients with the assessment, migration, implementation, and activation phases of data lakehouse consideration. Read more here.

Contact us today, and we'll help you build the data lakehouse of your dreams.

Jayesh Patil
,
Data Solutions Architect
,

Read More Insights From Our Team

View All

Take your company further. Unlock the power of data-driven decisions.

Go Further Today