Data Governance Framework Guide | Pillai Infotech LLP

Q: Do small companies need data governance?

Yes — at minimum, data quality tests and PII tracking. Start with dbt tests on critical tables. That's governance.

Q: Who should own data governance?

Data engineering owns tooling, business experts own definitions, a governance lead sets policies. It's a shared responsibility.

Q: How do I handle GDPR's right to deletion?

Map all PII locations via data catalog + lineage. Automate deletion across all systems. Manual processes don't scale.

Q: What's the difference between data governance and data management?

Management is operational work (building pipelines). Governance is the policies guiding that work (access, quality, standards).

Q: How do I get buy-in for data governance?

Don't pitch governance. Pitch solutions to pain: finding data faster, preventing bad reports, answering audit questions.

Data Governance Framework: Policies, Tools, and Implementation

Nobody gets excited about data governance — until a GDPR fine lands, an analyst uses the wrong dataset, or nobody knows where customer PII actually lives. Here's how to govern data without slowing everyone down.

📚 Database & Data September 26, 2025 13 min read

In This Guide

1. What Data Governance Actually Means
2. The Five Pillars of Data Governance
3. Data Catalog — Know What You Have
4. Data Quality — Trust Your Data
5. Data Lineage — Know Where It Came From
6. Access Control and Privacy
7. Implementation Roadmap
8. Frequently Asked Questions

Data governance is the unsexy foundation that makes everything else work — analytics, AI, compliance, and trust. Without it, you're making decisions on data nobody can verify, training models on datasets nobody understands, and hoping you're compliant with regulations nobody has mapped. This guide makes it practical.

1. What Data Governance Actually Means

Data governance is the system of policies, processes, and tools that ensure data is accurate, secure, accessible, and compliant. It answers four questions for every dataset:

What data do we have? — Data catalog and discovery
Is it accurate? — Data quality and validation
Where did it come from? — Data lineage and provenance
Who can access it? — Access controls, privacy, and compliance

2. The Five Pillars of Data Governance

Pillar	What It Covers	Key Tools	Priority
Data Catalog	Discovery, documentation, search	DataHub, Atlan, Alation	Start here
Data Quality	Validation, profiling, anomaly detection	Great Expectations, dbt tests, Soda	High
Data Lineage	Source-to-destination tracking	OpenLineage, Marquez, DataHub	Medium
Access Control	Who can see/modify what data	Unity Catalog, Apache Ranger, IAM	High (if regulated)
Data Privacy	PII detection, masking, compliance	Presidio, Privacera, column masking	Critical (if handling PII)

3. Data Catalog — Know What You Have

A data catalog is the "Google for your data" — it lets anyone in the organization discover, understand, and trust datasets without asking the data team.

Tool	Type	Best For	Pricing
DataHub	Open source (LinkedIn)	Technical teams, extensible	Free / Acryl Cloud
Atlan	SaaS	Modern data teams, great UX	Custom pricing
Unity Catalog	Open source (Databricks)	Lakehouse environments	Free / Databricks
OpenMetadata	Open source	Teams wanting full control	Free / SaaS

4. Data Quality — Trust Your Data

Dimension	Question It Answers	Check
Completeness	Are there missing values?	NULL rate per column < threshold
Uniqueness	Are there duplicates?	Unique constraint on IDs
Validity	Is data in expected format/range?	Email regex, price > 0, status in enum
Freshness	Is data up to date?	Max timestamp within expected window
Consistency	Do related datasets agree?	Row counts match, referential integrity
Volume	Did we get the expected amount?	Row count within expected range

dbt Data Quality Tests

# schema.yml — declarative data quality checks
version: 2

models:
  - name: orders
    description: "Order data from the e-commerce platform"
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('customers')
              field: customer_id
      - name: amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 100000
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled']

    # Table-level tests
    tests:
      - dbt_utils.recency:
          datepart: hour
          field: created_at
          interval: 24    # Fail if no data in 24 hours
      - dbt_utils.expression_is_true:
          expression: "count(*) > 0"

5. Data Lineage — Know Where It Came From

Data lineage tracks how data flows from source to destination — which systems produced it, which transformations modified it, and which reports consume it.

Stripe API  ──→  Raw Payments (Bronze)  ──→  Clean Payments (Silver)  ──→  Revenue Report
    │                                              │                              │
    │         Shopify API  ──→  Raw Orders  ──→  Clean Orders  ──→─────────────────┘
    │
    └── When Stripe changes their API schema, lineage tells you:
        • Which downstream tables are affected
        • Which dashboards will break
        • Who owns each step in the pipeline
        • What the blast radius is

Lineage Tool	Approach	Integrations
OpenLineage	Open standard — emits lineage events	Spark, Airflow, dbt, Flink
dbt (built-in)	SQL-level column lineage	Any SQL warehouse
DataHub / Atlan	Catalog + lineage in one platform	Broad ecosystem

6. Access Control and Privacy

Regulation	Region	Key Requirements	Max Fine
GDPR	EU	Consent, right to delete, data portability	4% of revenue / €20M
DPDPA	India	Consent, purpose limitation, data localization	₹250 crore (~$30M)
HIPAA	US (Healthcare)	PHI encryption, access logs, breach notification	$1.5M per violation
SOC 2	Global (SaaS)	Access controls, encryption, monitoring	N/A (customer trust)

Data Classification and Access Tiers

Classification Tiers:

🔴 RESTRICTED (Level 4)
   PII: SSN, Aadhaar, credit card numbers, health records
   Access: Named individuals only, audit logged, encrypted at rest + transit
   Masking: Always masked in non-production environments

🟠 CONFIDENTIAL (Level 3)
   Business: Revenue data, employee salaries, contracts
   Access: Department heads + approved analysts
   Masking: Masked in dev/staging

🟡 INTERNAL (Level 2)
   Operations: Product catalog, user activity (non-PII), internal metrics
   Access: All employees
   Masking: None needed

🟢 PUBLIC (Level 1)
   Marketing: Published content, public APIs, documentation
   Access: Anyone
   Masking: None

-- SQL implementation: Row-level security
CREATE POLICY customer_access ON customers
    USING (region = current_setting('app.user_region'));

-- Column masking for PII
CREATE VIEW safe_customers AS
SELECT id, name,
       regexp_replace(email, '(.)(.*)(@.*)', '\1***\3') AS email_masked,
       'XXX-XXX-' || right(phone, 4) AS phone_masked
FROM customers;

7. Implementation Roadmap

Phase	Timeline	Actions	Outcome
1. Audit	Week 1-2	Inventory all data sources, classify PII, map data flows	Data inventory document
2. Catalog	Week 3-4	Deploy data catalog, connect sources, assign owners	Searchable catalog
3. Quality	Week 5-8	Add dbt tests, set up anomaly detection, define SLAs	Quality dashboard + alerts
4. Access	Week 9-12	Implement classification, RLS, PII masking, audit logs	Compliant access controls
5. Lineage	Week 13-16	Enable OpenLineage in pipelines, build lineage views	Full data lineage graph

Our Advice: Start with data quality (dbt tests) and a basic catalog. These give the highest ROI. Don't try to boil the ocean with a full governance program on day one — that's how governance initiatives die. Start with the 5 most critical datasets, get the framework working, then expand. Governance that enables (makes data easier to find and trust) succeeds. Governance that only restricts (adds approval gates) fails.

Frequently Asked Questions

Do small companies need data governance?

Yes — at minimum, data quality tests and PII tracking. You don't need a full governance program, but knowing where customer PII lives and having basic quality checks prevents problems that are expensive to fix later. Start with dbt tests on your critical tables. That's governance.

Who should own data governance?

The data engineering team owns the tooling and automation. Business domain experts own the data definitions and quality rules. A data governance lead (or committee) sets policies and resolves disputes. In smaller orgs, the data team lead wears all three hats. The key: governance is a shared responsibility, not a single person's job.

How do I handle GDPR's right to deletion?

You need to know every place a user's PII exists — this is where a data catalog and lineage are critical. When a deletion request comes in: delete from the primary database, propagate to all downstream systems (analytics, backups, ML features, third-party tools), and log the deletion. Automate this — manual deletion processes don't scale and miss edge cases.

What's the difference between data governance and data management?

Data management is the operational work — building pipelines, managing databases, maintaining infrastructure. Data governance is the policies and controls that guide how that work is done — who can access what, what quality standards apply, how changes are approved. Management is the "how," governance is the "rules of how."

How do I get buy-in for data governance?

Don't pitch "governance." Pitch solutions to existing pain: "analysts spend 30% of their time finding and validating data" (→ data catalog). "We can't answer the auditor's question about data lineage" (→ lineage tool). "Bad data caused a wrong report last month" (→ quality checks). Tie governance to business problems, not compliance checkboxes.

📚

Pillai Infotech LLP

We implement data governance frameworks that enable rather than restrict — catalogs, quality automation, and compliance controls. Let's govern your data right.

Data Engineering Fundamentals: Building Modern Data Pipelines → Data Lakehouse Architecture: The Best of Data Lakes and Warehouses → Database Migration Strategies: Zero-Downtime Approaches →