In This Guide
For the last decade, data teams have been forced to pick between two imperfect options: data warehouses (reliable but expensive and siloed) or data lakes (cheap but unreliable and chaotic). The data lakehouse is the answer to that false choice — and it's rapidly becoming the default architecture for modern data platforms.
1. The Problem: Data Lakes vs Data Warehouses
| Factor | Data Warehouse | Data Lake | Lakehouse |
|---|---|---|---|
| Storage | Proprietary format | Open (Parquet, ORC) | Open (Parquet + metadata layer) |
| ACID transactions | Yes | No | Yes |
| Schema enforcement | Strong | None (schema-on-read) | Configurable (enforce or evolve) |
| Cost per TB/month | $23-40 (Snowflake, BigQuery) | $2-5 (S3, GCS) | $2-5 (same object storage) |
| BI query speed | Fast (optimized) | Slow (full scans) | Fast (data skipping, Z-ordering) |
| ML workloads | Poor (export data first) | Native (Spark, PyTorch) | Native |
| Data types | Structured only | Any (images, logs, JSON) | Any |
| Vendor lock-in | High | Low | Low (open formats) |
2. What Is a Data Lakehouse?
A data lakehouse adds a metadata and transaction layer on top of cheap object storage (S3, GCS, ADLS). The data stays as Parquet files — you just get warehouse-like features (ACID, schema, indexing) via a table format.
Traditional Data Lake:
S3 bucket → Parquet files (no guarantees, partial writes, schema drift)
Lakehouse:
S3 bucket → Parquet files + Transaction Log (Delta Lake / Iceberg / Hudi)
│
├── ACID commits (atomic writes)
├── Schema enforcement & evolution
├── Time travel (query past snapshots)
├── Partition & file-level statistics
└── UPDATE, DELETE, MERGE operations
Same storage. Same files. Just smarter metadata.
The key insight: you don't need a separate, expensive data warehouse for reliable analytics. You can get warehouse reliability on lake storage — at lake prices.
3. Open Table Formats — Delta Lake, Iceberg, Hudi
Three open table formats compete to be the lakehouse standard. Each takes a different approach to the same problem.
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| Creator | Databricks | Netflix → Apache | Uber → Apache |
| Transaction log | JSON log files on storage | Manifest + metadata files | Timeline + metadata |
| Best ecosystem | Databricks, Spark | Multi-engine (widest) | Spark, Flink |
| Schema evolution | Add/rename columns | Full (add, rename, reorder, promote) | Add/rename columns |
| Partition evolution | No (rewrite required) | Yes (hidden partitioning) | No |
| Time travel | Yes (30-day default) | Yes (snapshot-based) | Yes |
| Incremental reads | Change Data Feed | Incremental scan | Native (designed for incremental) |
Delta Lake — PySpark Example
# Write data as a Delta table (ACID-guaranteed)
df = spark.read.json("s3://raw-data/orders/2026-02-06/")
df.write \
.format("delta") \
.mode("append") \
.partitionBy("order_date") \
.save("s3://lakehouse/orders/")
# MERGE (upsert) — update existing, insert new
from delta.tables import DeltaTable
target = DeltaTable.forPath(spark, "s3://lakehouse/orders/")
source = spark.read.json("s3://raw-data/orders/latest/")
target.alias("t") \
.merge(source.alias("s"), "t.order_id = s.order_id") \
.whenMatchedUpdate(set={"status": "s.status", "updated_at": "s.updated_at"}) \
.whenNotMatchedInsertAll() \
.execute()
# Time travel — query yesterday's version
df_yesterday = spark.read \
.format("delta") \
.option("timestampAsOf", "2026-02-05") \
.load("s3://lakehouse/orders/")
# Schema enforcement — this fails if schema doesn't match
df_bad = spark.createDataFrame([{"wrong_column": "value"}])
df_bad.write.format("delta").mode("append").save("s3://lakehouse/orders/")
# AnalysisException: schema mismatch detected
Apache Iceberg — SQL Example
-- Create an Iceberg table (works with Spark, Trino, Flink, Dremio)
CREATE TABLE lakehouse.orders (
order_id STRING,
customer_id STRING,
amount DECIMAL(10,2),
status STRING,
order_date DATE
) USING iceberg
PARTITIONED BY (days(order_date)); -- Hidden partitioning!
-- Users don't need to know about partitions
SELECT * FROM lakehouse.orders WHERE order_date = '2026-02-06';
-- Iceberg automatically prunes to the right partition
-- Partition evolution — change partitioning without rewriting data
ALTER TABLE lakehouse.orders ADD PARTITION FIELD months(order_date);
-- Time travel
SELECT * FROM lakehouse.orders VERSION AS OF 12345; -- snapshot ID
SELECT * FROM lakehouse.orders TIMESTAMP AS OF '2026-02-05T10:00:00';
-- Expire old snapshots (cleanup)
CALL lakehouse.system.expire_snapshots('orders', TIMESTAMP '2026-01-01');
4. Lakehouse Architecture in Practice
Medallion Architecture (Bronze → Silver → Gold)
Data Sources (APIs, DBs, Events)
│
▼
┌─────────────────────────────────────────────────────┐
│ BRONZE (Raw) │
│ s3://lakehouse/bronze/orders/ │
│ • Raw data as-is from source │
│ • Append-only, no transformations │
│ • Schema-on-read, keep everything │
│ • Retention: forever (cheap storage) │
└─────────────────────┬───────────────────────────────┘
│ dbt / Spark ETL
▼
┌─────────────────────────────────────────────────────┐
│ SILVER (Cleaned) │
│ s3://lakehouse/silver/orders/ │
│ • Deduplicated, validated, typed │
│ • Schema enforced, nulls handled │
│ • Slowly changing dimensions applied │
│ • Retention: 2-5 years │
└─────────────────────┬───────────────────────────────┘
│ dbt models / aggregations
▼
┌─────────────────────────────────────────────────────┐
│ GOLD (Business-Ready) │
│ s3://lakehouse/gold/monthly_revenue/ │
│ • Aggregated, business-logic applied │
│ • Ready for BI dashboards and reports │
│ • Optimized for query performance (Z-ordered) │
│ • Retention: as needed │
└─────────────────────────────────────────────────────┘
5. Query Engines — How You Actually Query a Lakehouse
| Engine | Type | Best For | Supports |
|---|---|---|---|
| Spark SQL | Batch + streaming | ETL, ML, large-scale analytics | Delta, Iceberg, Hudi |
| Trino (Presto) | Interactive SQL | Ad-hoc queries, BI tools | Delta, Iceberg, Hudi |
| Dremio | Lakehouse query engine | Self-service BI on lake data | Iceberg (native) |
| StarRocks / DuckDB | OLAP / embedded | Low-latency dashboards / local analysis | Iceberg, Delta |
| Databricks SQL | Managed SQL warehouse | Databricks customers, BI | Delta (native), Iceberg |
6. When to Use (and Not Use) a Lakehouse
| Scenario | Recommendation | Why |
|---|---|---|
| Both analytics AND ML workloads | Lakehouse | Single source of truth for both |
| Data warehouse costs growing fast | Lakehouse | 10x cheaper storage |
| Need multi-engine access to same data | Lakehouse (Iceberg) | Open format, no lock-in |
| Small data (< 100GB), pure BI | Data warehouse | Simpler, faster setup |
| No data engineering team | Managed warehouse | Lakehouse requires more ops |
| Streaming + batch in one platform | Lakehouse | Native streaming table support |
Frequently Asked Questions
Does a lakehouse replace my data warehouse?
It can. Databricks, Dremio, and StarRocks deliver BI-grade query performance on lakehouse tables. For most workloads, a lakehouse with a fast query engine matches warehouse performance at a fraction of the cost. However, if your team is deeply invested in Snowflake/BigQuery and it works well, the migration cost may not justify the savings.
Delta Lake or Iceberg — which should I choose?
Iceberg if you use multiple engines (Spark, Trino, Flink, Dremio). Delta Lake if you're all-in on Databricks. Iceberg's momentum is stronger in the open-source ecosystem — AWS, Google, Snowflake, and Databricks all support it now. Delta Lake has the best Spark integration but is less portable.
Is a lakehouse just a data lake with extra steps?
In a sense, yes — and that's the point. The "extra steps" (ACID transactions, schema enforcement, indexing) are what make a data lake actually usable for reliable analytics. A raw data lake becomes a "data swamp" fast. The lakehouse is what a data lake should have been all along.
How does lakehouse performance compare to Snowflake?
For typical BI queries, Snowflake is 20-50% faster out of the box due to decades of query optimization. But with Z-ordering, data compaction, and a fast engine like StarRocks or Databricks SQL, the gap narrows to 5-15%. For large-scale ML and unstructured data workloads, the lakehouse wins — Snowflake wasn't designed for those.
What is a data catalog and do I need one?
A data catalog (Unity Catalog, AWS Glue Catalog, Nessie, Polaris) is a centralized registry of your lakehouse tables — it tracks schemas, permissions, lineage, and allows multiple engines to discover tables. You need one once you have more than a few tables. Without it, different teams create conflicting views of the same data.
Pillai Infotech LLP
We design data lakehouse architectures for organizations ready to unify their analytics and ML platforms. Let's build your lakehouse.