Data Lakehouse Architecture | Pillai Infotech LLP

Q: Does a lakehouse replace my data warehouse?

It can. Modern lakehouse query engines match warehouse performance at a fraction of the cost. But don't migrate a working setup just for the trend.

Q: Delta Lake or Iceberg — which should I choose?

Iceberg for multi-engine environments. Delta Lake if all-in on Databricks. Iceberg has stronger open-source momentum in 2026.

Q: Is a lakehouse just a data lake with extra steps?

Yes — ACID transactions, schema enforcement, and indexing. These extras make data lakes reliable for analytics.

Q: How does lakehouse performance compare to Snowflake?

Snowflake is 20-50% faster out of the box for BI queries, but the gap narrows to 5-15% with optimization. Lakehouse wins for ML workloads.

Q: What is a data catalog and do I need one?

A centralized registry for lakehouse tables. You need one once you have more than a few tables to prevent conflicting data views.

Data Lakehouse Architecture: The Best of Data Lakes and Warehouses

Data warehouses are expensive and rigid. Data lakes are cheap but chaotic. The lakehouse gives you both — ACID transactions on object storage, schema enforcement on raw files, and BI queries on your data lake.

📈 Database & Data February 6, 2026 13 min read

In This Guide

1. The Problem: Data Lakes vs Data Warehouses
2. What Is a Data Lakehouse?
3. Open Table Formats — Delta Lake, Iceberg, Hudi
4. Lakehouse Architecture in Practice
5. Query Engines — How You Actually Query a Lakehouse
6. When to Use (and Not Use) a Lakehouse
7. Frequently Asked Questions

For the last decade, data teams have been forced to pick between two imperfect options: data warehouses (reliable but expensive and siloed) or data lakes (cheap but unreliable and chaotic). The data lakehouse is the answer to that false choice — and it's rapidly becoming the default architecture for modern data platforms.

1. The Problem: Data Lakes vs Data Warehouses

Factor	Data Warehouse	Data Lake	Lakehouse
Storage	Proprietary format	Open (Parquet, ORC)	Open (Parquet + metadata layer)
ACID transactions	Yes	No	Yes
Schema enforcement	Strong	None (schema-on-read)	Configurable (enforce or evolve)
Cost per TB/month	$23-40 (Snowflake, BigQuery)	$2-5 (S3, GCS)	$2-5 (same object storage)
BI query speed	Fast (optimized)	Slow (full scans)	Fast (data skipping, Z-ordering)
ML workloads	Poor (export data first)	Native (Spark, PyTorch)	Native
Data types	Structured only	Any (images, logs, JSON)	Any
Vendor lock-in	High	Low	Low (open formats)

2. What Is a Data Lakehouse?

A data lakehouse adds a metadata and transaction layer on top of cheap object storage (S3, GCS, ADLS). The data stays as Parquet files — you just get warehouse-like features (ACID, schema, indexing) via a table format.

Traditional Data Lake:
    S3 bucket → Parquet files (no guarantees, partial writes, schema drift)

Lakehouse:
    S3 bucket → Parquet files + Transaction Log (Delta Lake / Iceberg / Hudi)
                                    │
                                    ├── ACID commits (atomic writes)
                                    ├── Schema enforcement & evolution
                                    ├── Time travel (query past snapshots)
                                    ├── Partition & file-level statistics
                                    └── UPDATE, DELETE, MERGE operations

Same storage. Same files. Just smarter metadata.

The key insight: you don't need a separate, expensive data warehouse for reliable analytics. You can get warehouse reliability on lake storage — at lake prices.

3. Open Table Formats — Delta Lake, Iceberg, Hudi

Three open table formats compete to be the lakehouse standard. Each takes a different approach to the same problem.

Feature	Delta Lake	Apache Iceberg	Apache Hudi
Creator	Databricks	Netflix → Apache	Uber → Apache
Transaction log	JSON log files on storage	Manifest + metadata files	Timeline + metadata
Best ecosystem	Databricks, Spark	Multi-engine (widest)	Spark, Flink
Schema evolution	Add/rename columns	Full (add, rename, reorder, promote)	Add/rename columns
Partition evolution	No (rewrite required)	Yes (hidden partitioning)	No
Time travel	Yes (30-day default)	Yes (snapshot-based)	Yes
Incremental reads	Change Data Feed	Incremental scan	Native (designed for incremental)

Delta Lake — PySpark Example

# Write data as a Delta table (ACID-guaranteed)
df = spark.read.json("s3://raw-data/orders/2026-02-06/")

df.write \
    .format("delta") \
    .mode("append") \
    .partitionBy("order_date") \
    .save("s3://lakehouse/orders/")

# MERGE (upsert) — update existing, insert new
from delta.tables import DeltaTable

target = DeltaTable.forPath(spark, "s3://lakehouse/orders/")
source = spark.read.json("s3://raw-data/orders/latest/")

target.alias("t") \
    .merge(source.alias("s"), "t.order_id = s.order_id") \
    .whenMatchedUpdate(set={"status": "s.status", "updated_at": "s.updated_at"}) \
    .whenNotMatchedInsertAll() \
    .execute()

# Time travel — query yesterday's version
df_yesterday = spark.read \
    .format("delta") \
    .option("timestampAsOf", "2026-02-05") \
    .load("s3://lakehouse/orders/")

# Schema enforcement — this fails if schema doesn't match
df_bad = spark.createDataFrame([{"wrong_column": "value"}])
df_bad.write.format("delta").mode("append").save("s3://lakehouse/orders/")
# AnalysisException: schema mismatch detected

Apache Iceberg — SQL Example

-- Create an Iceberg table (works with Spark, Trino, Flink, Dremio)
CREATE TABLE lakehouse.orders (
    order_id    STRING,
    customer_id STRING,
    amount      DECIMAL(10,2),
    status      STRING,
    order_date  DATE
) USING iceberg
PARTITIONED BY (days(order_date));  -- Hidden partitioning!

-- Users don't need to know about partitions
SELECT * FROM lakehouse.orders WHERE order_date = '2026-02-06';
-- Iceberg automatically prunes to the right partition

-- Partition evolution — change partitioning without rewriting data
ALTER TABLE lakehouse.orders ADD PARTITION FIELD months(order_date);

-- Time travel
SELECT * FROM lakehouse.orders VERSION AS OF 12345;  -- snapshot ID
SELECT * FROM lakehouse.orders TIMESTAMP AS OF '2026-02-05T10:00:00';

-- Expire old snapshots (cleanup)
CALL lakehouse.system.expire_snapshots('orders', TIMESTAMP '2026-01-01');

Which Format to Pick: Databricks shop? Delta Lake. Multi-engine environment (Spark + Trino + Flink)? Iceberg — its engine-agnostic design is winning the ecosystem war. Heavy incremental/streaming workloads? Hudi. If unsure, Iceberg is the safest bet in 2026 — it has the broadest engine support and the most active open-source community.

4. Lakehouse Architecture in Practice

Medallion Architecture (Bronze → Silver → Gold)

Data Sources (APIs, DBs, Events)
    │
    ▼
┌─────────────────────────────────────────────────────┐
│  BRONZE (Raw)                                        │
│  s3://lakehouse/bronze/orders/                       │
│  • Raw data as-is from source                        │
│  • Append-only, no transformations                   │
│  • Schema-on-read, keep everything                   │
│  • Retention: forever (cheap storage)                │
└─────────────────────┬───────────────────────────────┘
                      │  dbt / Spark ETL
                      ▼
┌─────────────────────────────────────────────────────┐
│  SILVER (Cleaned)                                    │
│  s3://lakehouse/silver/orders/                       │
│  • Deduplicated, validated, typed                    │
│  • Schema enforced, nulls handled                    │
│  • Slowly changing dimensions applied                │
│  • Retention: 2-5 years                              │
└─────────────────────┬───────────────────────────────┘
                      │  dbt models / aggregations
                      ▼
┌─────────────────────────────────────────────────────┐
│  GOLD (Business-Ready)                               │
│  s3://lakehouse/gold/monthly_revenue/                │
│  • Aggregated, business-logic applied                │
│  • Ready for BI dashboards and reports               │
│  • Optimized for query performance (Z-ordered)       │
│  • Retention: as needed                              │
└─────────────────────────────────────────────────────┘

5. Query Engines — How You Actually Query a Lakehouse

Engine	Type	Best For	Supports
Spark SQL	Batch + streaming	ETL, ML, large-scale analytics	Delta, Iceberg, Hudi
Trino (Presto)	Interactive SQL	Ad-hoc queries, BI tools	Delta, Iceberg, Hudi
Dremio	Lakehouse query engine	Self-service BI on lake data	Iceberg (native)
StarRocks / DuckDB	OLAP / embedded	Low-latency dashboards / local analysis	Iceberg, Delta
Databricks SQL	Managed SQL warehouse	Databricks customers, BI	Delta (native), Iceberg

6. When to Use (and Not Use) a Lakehouse

Scenario	Recommendation	Why
Both analytics AND ML workloads	Lakehouse	Single source of truth for both
Data warehouse costs growing fast	Lakehouse	10x cheaper storage
Need multi-engine access to same data	Lakehouse (Iceberg)	Open format, no lock-in
Small data (< 100GB), pure BI	Data warehouse	Simpler, faster setup
No data engineering team	Managed warehouse	Lakehouse requires more ops
Streaming + batch in one platform	Lakehouse	Native streaming table support

Our Perspective: The lakehouse is the right default architecture for new data platforms in 2026. But don't migrate a working Snowflake/BigQuery setup just because lakehouses are trendy. Migrate when you hit a genuine pain point — cost, ML workload friction, or vendor lock-in. The best architecture is the one your team can actually operate.

Frequently Asked Questions

Does a lakehouse replace my data warehouse?

It can. Databricks, Dremio, and StarRocks deliver BI-grade query performance on lakehouse tables. For most workloads, a lakehouse with a fast query engine matches warehouse performance at a fraction of the cost. However, if your team is deeply invested in Snowflake/BigQuery and it works well, the migration cost may not justify the savings.

Delta Lake or Iceberg — which should I choose?

Iceberg if you use multiple engines (Spark, Trino, Flink, Dremio). Delta Lake if you're all-in on Databricks. Iceberg's momentum is stronger in the open-source ecosystem — AWS, Google, Snowflake, and Databricks all support it now. Delta Lake has the best Spark integration but is less portable.

Is a lakehouse just a data lake with extra steps?

In a sense, yes — and that's the point. The "extra steps" (ACID transactions, schema enforcement, indexing) are what make a data lake actually usable for reliable analytics. A raw data lake becomes a "data swamp" fast. The lakehouse is what a data lake should have been all along.

How does lakehouse performance compare to Snowflake?

For typical BI queries, Snowflake is 20-50% faster out of the box due to decades of query optimization. But with Z-ordering, data compaction, and a fast engine like StarRocks or Databricks SQL, the gap narrows to 5-15%. For large-scale ML and unstructured data workloads, the lakehouse wins — Snowflake wasn't designed for those.

What is a data catalog and do I need one?

A data catalog (Unity Catalog, AWS Glue Catalog, Nessie, Polaris) is a centralized registry of your lakehouse tables — it tracks schemas, permissions, lineage, and allows multiple engines to discover tables. You need one once you have more than a few tables. Without it, different teams create conflicting views of the same data.

📈

Pillai Infotech LLP

We design data lakehouse architectures for organizations ready to unify their analytics and ML platforms. Let's build your lakehouse.

Data Engineering Fundamentals: Building Modern Data Pipelines → Real-Time Data Processing: Kafka, Flink, and Stream Architecture → PostgreSQL vs MySQL: Database Comparison for 2026 →