In This Guide
Data governance is the unsexy foundation that makes everything else work — analytics, AI, compliance, and trust. Without it, you're making decisions on data nobody can verify, training models on datasets nobody understands, and hoping you're compliant with regulations nobody has mapped. This guide makes it practical.
1. What Data Governance Actually Means
Data governance is the system of policies, processes, and tools that ensure data is accurate, secure, accessible, and compliant. It answers four questions for every dataset:
- What data do we have? — Data catalog and discovery
- Is it accurate? — Data quality and validation
- Where did it come from? — Data lineage and provenance
- Who can access it? — Access controls, privacy, and compliance
2. The Five Pillars of Data Governance
| Pillar | What It Covers | Key Tools | Priority |
|---|---|---|---|
| Data Catalog | Discovery, documentation, search | DataHub, Atlan, Alation | Start here |
| Data Quality | Validation, profiling, anomaly detection | Great Expectations, dbt tests, Soda | High |
| Data Lineage | Source-to-destination tracking | OpenLineage, Marquez, DataHub | Medium |
| Access Control | Who can see/modify what data | Unity Catalog, Apache Ranger, IAM | High (if regulated) |
| Data Privacy | PII detection, masking, compliance | Presidio, Privacera, column masking | Critical (if handling PII) |
3. Data Catalog — Know What You Have
A data catalog is the "Google for your data" — it lets anyone in the organization discover, understand, and trust datasets without asking the data team.
| Tool | Type | Best For | Pricing |
|---|---|---|---|
| DataHub | Open source (LinkedIn) | Technical teams, extensible | Free / Acryl Cloud |
| Atlan | SaaS | Modern data teams, great UX | Custom pricing |
| Unity Catalog | Open source (Databricks) | Lakehouse environments | Free / Databricks |
| OpenMetadata | Open source | Teams wanting full control | Free / SaaS |
4. Data Quality — Trust Your Data
| Dimension | Question It Answers | Check |
|---|---|---|
| Completeness | Are there missing values? | NULL rate per column < threshold |
| Uniqueness | Are there duplicates? | Unique constraint on IDs |
| Validity | Is data in expected format/range? | Email regex, price > 0, status in enum |
| Freshness | Is data up to date? | Max timestamp within expected window |
| Consistency | Do related datasets agree? | Row counts match, referential integrity |
| Volume | Did we get the expected amount? | Row count within expected range |
dbt Data Quality Tests
# schema.yml — declarative data quality checks
version: 2
models:
- name: orders
description: "Order data from the e-commerce platform"
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('customers')
field: customer_id
- name: amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 100000
- name: status
tests:
- accepted_values:
values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled']
# Table-level tests
tests:
- dbt_utils.recency:
datepart: hour
field: created_at
interval: 24 # Fail if no data in 24 hours
- dbt_utils.expression_is_true:
expression: "count(*) > 0"
5. Data Lineage — Know Where It Came From
Data lineage tracks how data flows from source to destination — which systems produced it, which transformations modified it, and which reports consume it.
Stripe API ──→ Raw Payments (Bronze) ──→ Clean Payments (Silver) ──→ Revenue Report
│ │ │
│ Shopify API ──→ Raw Orders ──→ Clean Orders ──→─────────────────┘
│
└── When Stripe changes their API schema, lineage tells you:
• Which downstream tables are affected
• Which dashboards will break
• Who owns each step in the pipeline
• What the blast radius is
| Lineage Tool | Approach | Integrations |
|---|---|---|
| OpenLineage | Open standard — emits lineage events | Spark, Airflow, dbt, Flink |
| dbt (built-in) | SQL-level column lineage | Any SQL warehouse |
| DataHub / Atlan | Catalog + lineage in one platform | Broad ecosystem |
6. Access Control and Privacy
| Regulation | Region | Key Requirements | Max Fine |
|---|---|---|---|
| GDPR | EU | Consent, right to delete, data portability | 4% of revenue / €20M |
| DPDPA | India | Consent, purpose limitation, data localization | ₹250 crore (~$30M) |
| HIPAA | US (Healthcare) | PHI encryption, access logs, breach notification | $1.5M per violation |
| SOC 2 | Global (SaaS) | Access controls, encryption, monitoring | N/A (customer trust) |
Data Classification and Access Tiers
Classification Tiers:
🔴 RESTRICTED (Level 4)
PII: SSN, Aadhaar, credit card numbers, health records
Access: Named individuals only, audit logged, encrypted at rest + transit
Masking: Always masked in non-production environments
🟠 CONFIDENTIAL (Level 3)
Business: Revenue data, employee salaries, contracts
Access: Department heads + approved analysts
Masking: Masked in dev/staging
🟡 INTERNAL (Level 2)
Operations: Product catalog, user activity (non-PII), internal metrics
Access: All employees
Masking: None needed
🟢 PUBLIC (Level 1)
Marketing: Published content, public APIs, documentation
Access: Anyone
Masking: None
-- SQL implementation: Row-level security
CREATE POLICY customer_access ON customers
USING (region = current_setting('app.user_region'));
-- Column masking for PII
CREATE VIEW safe_customers AS
SELECT id, name,
regexp_replace(email, '(.)(.*)(@.*)', '\1***\3') AS email_masked,
'XXX-XXX-' || right(phone, 4) AS phone_masked
FROM customers;
7. Implementation Roadmap
| Phase | Timeline | Actions | Outcome |
|---|---|---|---|
| 1. Audit | Week 1-2 | Inventory all data sources, classify PII, map data flows | Data inventory document |
| 2. Catalog | Week 3-4 | Deploy data catalog, connect sources, assign owners | Searchable catalog |
| 3. Quality | Week 5-8 | Add dbt tests, set up anomaly detection, define SLAs | Quality dashboard + alerts |
| 4. Access | Week 9-12 | Implement classification, RLS, PII masking, audit logs | Compliant access controls |
| 5. Lineage | Week 13-16 | Enable OpenLineage in pipelines, build lineage views | Full data lineage graph |
Frequently Asked Questions
Do small companies need data governance?
Yes — at minimum, data quality tests and PII tracking. You don't need a full governance program, but knowing where customer PII lives and having basic quality checks prevents problems that are expensive to fix later. Start with dbt tests on your critical tables. That's governance.
Who should own data governance?
The data engineering team owns the tooling and automation. Business domain experts own the data definitions and quality rules. A data governance lead (or committee) sets policies and resolves disputes. In smaller orgs, the data team lead wears all three hats. The key: governance is a shared responsibility, not a single person's job.
How do I handle GDPR's right to deletion?
You need to know every place a user's PII exists — this is where a data catalog and lineage are critical. When a deletion request comes in: delete from the primary database, propagate to all downstream systems (analytics, backups, ML features, third-party tools), and log the deletion. Automate this — manual deletion processes don't scale and miss edge cases.
What's the difference between data governance and data management?
Data management is the operational work — building pipelines, managing databases, maintaining infrastructure. Data governance is the policies and controls that guide how that work is done — who can access what, what quality standards apply, how changes are approved. Management is the "how," governance is the "rules of how."
How do I get buy-in for data governance?
Don't pitch "governance." Pitch solutions to existing pain: "analysts spend 30% of their time finding and validating data" (→ data catalog). "We can't answer the auditor's question about data lineage" (→ lineage tool). "Bad data caused a wrong report last month" (→ quality checks). Tie governance to business problems, not compliance checkboxes.
Pillai Infotech LLP
We implement data governance frameworks that enable rather than restrict — catalogs, quality automation, and compliance controls. Let's govern your data right.