Ideas Engineered for Tomorrow
We Engineer Services & Solutions for Your Business Needs
Home About
Products
Services
Hire
Industries
Consulting
Partners
Articles Careers Contact
Data & Analytics

Data Governance Framework: Policies, Tools, and Implementation

Nobody gets excited about data governance — until a GDPR fine lands, an analyst uses the wrong dataset, or nobody knows where customer PII actually lives. Here's how to govern data without slowing everyone down.

📚 Database & Data September 26, 2025 13 min read

In This Guide

Data governance is the unsexy foundation that makes everything else work — analytics, AI, compliance, and trust. Without it, you're making decisions on data nobody can verify, training models on datasets nobody understands, and hoping you're compliant with regulations nobody has mapped. This guide makes it practical.

1. What Data Governance Actually Means

Data governance is the system of policies, processes, and tools that ensure data is accurate, secure, accessible, and compliant. It answers four questions for every dataset:

2. The Five Pillars of Data Governance

Pillar What It Covers Key Tools Priority
Data CatalogDiscovery, documentation, searchDataHub, Atlan, AlationStart here
Data QualityValidation, profiling, anomaly detectionGreat Expectations, dbt tests, SodaHigh
Data LineageSource-to-destination trackingOpenLineage, Marquez, DataHubMedium
Access ControlWho can see/modify what dataUnity Catalog, Apache Ranger, IAMHigh (if regulated)
Data PrivacyPII detection, masking, compliancePresidio, Privacera, column maskingCritical (if handling PII)

3. Data Catalog — Know What You Have

A data catalog is the "Google for your data" — it lets anyone in the organization discover, understand, and trust datasets without asking the data team.

Tool Type Best For Pricing
DataHubOpen source (LinkedIn)Technical teams, extensibleFree / Acryl Cloud
AtlanSaaSModern data teams, great UXCustom pricing
Unity CatalogOpen source (Databricks)Lakehouse environmentsFree / Databricks
OpenMetadataOpen sourceTeams wanting full controlFree / SaaS

4. Data Quality — Trust Your Data

Dimension Question It Answers Check
CompletenessAre there missing values?NULL rate per column < threshold
UniquenessAre there duplicates?Unique constraint on IDs
ValidityIs data in expected format/range?Email regex, price > 0, status in enum
FreshnessIs data up to date?Max timestamp within expected window
ConsistencyDo related datasets agree?Row counts match, referential integrity
VolumeDid we get the expected amount?Row count within expected range

dbt Data Quality Tests

# schema.yml — declarative data quality checks
version: 2

models:
  - name: orders
    description: "Order data from the e-commerce platform"
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('customers')
              field: customer_id
      - name: amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 100000
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled']

    # Table-level tests
    tests:
      - dbt_utils.recency:
          datepart: hour
          field: created_at
          interval: 24    # Fail if no data in 24 hours
      - dbt_utils.expression_is_true:
          expression: "count(*) > 0"

5. Data Lineage — Know Where It Came From

Data lineage tracks how data flows from source to destination — which systems produced it, which transformations modified it, and which reports consume it.

Stripe API  ──→  Raw Payments (Bronze)  ──→  Clean Payments (Silver)  ──→  Revenue Report
    │                                              │                              │
    │         Shopify API  ──→  Raw Orders  ──→  Clean Orders  ──→─────────────────┘
    │
    └── When Stripe changes their API schema, lineage tells you:
        • Which downstream tables are affected
        • Which dashboards will break
        • Who owns each step in the pipeline
        • What the blast radius is
Lineage Tool Approach Integrations
OpenLineageOpen standard — emits lineage eventsSpark, Airflow, dbt, Flink
dbt (built-in)SQL-level column lineageAny SQL warehouse
DataHub / AtlanCatalog + lineage in one platformBroad ecosystem

6. Access Control and Privacy

Regulation Region Key Requirements Max Fine
GDPREUConsent, right to delete, data portability4% of revenue / €20M
DPDPAIndiaConsent, purpose limitation, data localization₹250 crore (~$30M)
HIPAAUS (Healthcare)PHI encryption, access logs, breach notification$1.5M per violation
SOC 2Global (SaaS)Access controls, encryption, monitoringN/A (customer trust)

Data Classification and Access Tiers

Classification Tiers:

🔴 RESTRICTED (Level 4)
   PII: SSN, Aadhaar, credit card numbers, health records
   Access: Named individuals only, audit logged, encrypted at rest + transit
   Masking: Always masked in non-production environments

🟠 CONFIDENTIAL (Level 3)
   Business: Revenue data, employee salaries, contracts
   Access: Department heads + approved analysts
   Masking: Masked in dev/staging

🟡 INTERNAL (Level 2)
   Operations: Product catalog, user activity (non-PII), internal metrics
   Access: All employees
   Masking: None needed

🟢 PUBLIC (Level 1)
   Marketing: Published content, public APIs, documentation
   Access: Anyone
   Masking: None

-- SQL implementation: Row-level security
CREATE POLICY customer_access ON customers
    USING (region = current_setting('app.user_region'));

-- Column masking for PII
CREATE VIEW safe_customers AS
SELECT id, name,
       regexp_replace(email, '(.)(.*)(@.*)', '\1***\3') AS email_masked,
       'XXX-XXX-' || right(phone, 4) AS phone_masked
FROM customers;

7. Implementation Roadmap

Phase Timeline Actions Outcome
1. AuditWeek 1-2Inventory all data sources, classify PII, map data flowsData inventory document
2. CatalogWeek 3-4Deploy data catalog, connect sources, assign ownersSearchable catalog
3. QualityWeek 5-8Add dbt tests, set up anomaly detection, define SLAsQuality dashboard + alerts
4. AccessWeek 9-12Implement classification, RLS, PII masking, audit logsCompliant access controls
5. LineageWeek 13-16Enable OpenLineage in pipelines, build lineage viewsFull data lineage graph
Our Advice: Start with data quality (dbt tests) and a basic catalog. These give the highest ROI. Don't try to boil the ocean with a full governance program on day one — that's how governance initiatives die. Start with the 5 most critical datasets, get the framework working, then expand. Governance that enables (makes data easier to find and trust) succeeds. Governance that only restricts (adds approval gates) fails.

Frequently Asked Questions

Do small companies need data governance?

Yes — at minimum, data quality tests and PII tracking. You don't need a full governance program, but knowing where customer PII lives and having basic quality checks prevents problems that are expensive to fix later. Start with dbt tests on your critical tables. That's governance.

Who should own data governance?

The data engineering team owns the tooling and automation. Business domain experts own the data definitions and quality rules. A data governance lead (or committee) sets policies and resolves disputes. In smaller orgs, the data team lead wears all three hats. The key: governance is a shared responsibility, not a single person's job.

How do I handle GDPR's right to deletion?

You need to know every place a user's PII exists — this is where a data catalog and lineage are critical. When a deletion request comes in: delete from the primary database, propagate to all downstream systems (analytics, backups, ML features, third-party tools), and log the deletion. Automate this — manual deletion processes don't scale and miss edge cases.

What's the difference between data governance and data management?

Data management is the operational work — building pipelines, managing databases, maintaining infrastructure. Data governance is the policies and controls that guide how that work is done — who can access what, what quality standards apply, how changes are approved. Management is the "how," governance is the "rules of how."

How do I get buy-in for data governance?

Don't pitch "governance." Pitch solutions to existing pain: "analysts spend 30% of their time finding and validating data" (→ data catalog). "We can't answer the auditor's question about data lineage" (→ lineage tool). "Bad data caused a wrong report last month" (→ quality checks). Tie governance to business problems, not compliance checkboxes.

📚

Pillai Infotech LLP

We implement data governance frameworks that enable rather than restrict — catalogs, quality automation, and compliance controls. Let's govern your data right.

Related Articles

Data Engineering Fundamentals: Building Modern Data Pipelines → Data Lakehouse Architecture: The Best of Data Lakes and Warehouses → Database Migration Strategies: Zero-Downtime Approaches →