How would you build a data catalog with lineage and governance?

A data catalog with lineage and governance is your organization wide index of datasets, columns, dashboards, jobs, and policies. It tells you what data you have, where it came from, who owns it, who can access it, and how it moves across pipelines. Think of it as a search engine plus a graph of flows plus a policy brain. In a system design interview this problem tests your ability to design a metadata first platform that scales, keeps facts fresh, and enforces rules without blocking delivery.

Why It Matters

A strong catalog raises confidence and speed in distributed systems and scalable architecture.

Faster discovery so engineers and analysts spend less time hunting for tables
Trust through visible lineage, quality signals, and owners on call
Safer access with fine grained policies, masking, and audits that satisfy compliance
Lower incident time by tracing broken data to the upstream change
Better change management with schema versioning and deprecation workflows

How It Works step by step

Below is a practical blueprint you can adapt during an interview or in production.

Step one define the metadata model

Start with an explicit model. Core entities usually include dataset, field, job, dashboard, user, group, domain, tag, policy, and incident. Relationships capture produces, consumes, owns, documents, and depends on. Add version and time fields everywhere.

Table level and column level metadata
Operational stats such as freshness, row count, and cost hints
Business context such as domain, owner, purpose, and classification for privacy

Step two collect metadata from sources

Use connectors and crawlers to ingest schemas and stats from warehouses, lakes, orchestration tools, query engines, and BI tools. Two common patterns exist.

Pull based crawlers that scan systems on a schedule
Push based emitters in jobs that send events to the catalog during runs

Aim for both. Crawlers backfill. Emitters keep things fresh within minutes.

Step three build lineage

Lineage links sources, transforms, and sinks. Capture at both table level and column level.

Parse queries and job plans to infer read and write sets
Instrument pipelines to emit events with inputs and outputs
Reconcile runs into a directed acyclic graph per dataset version
Keep change history so you can answer questions at a past point in time

Column level lineage is gold for impact analysis and selective masking.

Step four store metadata in the right engines

Use a graph store for lineage and relationships. Use a document or relational store for entity detail and versions. Index key fields in a search engine for fast discovery.

Graph store for nodes and edges at scale
Document or relational store for entity snapshots and history
Search index for full text with facets such as domain, owner, freshness, and tags

Step five add a policy engine for governance

Governance is not only access control. It includes classification, retention, and usage guardrails.

Classify data with automated scanners for names, emails, and other sensitive classes
Model policies with RBAC and ABAC so rules can reference roles, attributes, tags, and lineage
Enforce policies in query engines with row level filters and column masking
Log every decision with who, what, when, why for audits

Policies should be human readable and testable like code.

Step six surface discovery and context in a portal

The portal is where people search, browse, and contribute.

Global search with synonyms, ranking by usage, and quick filters
Dataset pages with schema, samples, owners, quality checks, downstream dashboards, and run history
Lineage graph with time slider, column level detail, and impact analysis
Edit flows for docs, owners, and tags with reviews and notifications

Documentation with examples and data contracts turns the portal into a living playbook.

Step seven integrate data quality signals

Quality tells you if data is fresh and correct enough to trust.

Profile columns for distributions, nulls, and outliers
Run tests in pipelines and collect pass or fail as metadata
Compute data health scores that roll up to tables, domains, and business units

Show quality banners on dataset pages and block risky changes when needed.

Step eight support change management

You need to manage evolution in distributed data systems.

Detect schema changes and classify them as compatible or breaking
Gate production writes when an unapproved breaking change appears
Provide a deprecation flow with owners, migration steps, and sunset dates
Notify downstream owners automatically based on lineage

Step nine power programmatic access with APIs and events

Everything in the catalog should be accessible as APIs.

Read and write APIs for entities, lineage, and policies
Webhooks and streams for change events so other systems react in near real time
Batch export and import for migration, backup, and offline analysis

Step ten plan for scale and reliability

Treat the catalog as a product with SLOs.

Throughput targets for event ingestion and search queries
Indexing queues with backpressure and retry logic
Idempotent upserts so repeated events do not duplicate edges
Caching hot metadata for the portal and policy checks
Blue green upgrades for the portal and services to avoid downtime

Real World Example

Picture a streaming company with a lake on object storage, a warehouse for analytics, and a lakehouse engine for batch and streaming. Engineers run jobs through an orchestrator and analysts query through a SQL platform and build dashboards in a BI tool.

Crawlers scan the warehouse daily to ingest schemas and stats
Job wrappers emit run events with input and output datasets to an event bus
A lineage processor reconciles events into a graph so you can click from a dashboard back to every upstream table and job
A policy engine masks sensitive columns automatically for analysts outside the finance role
The portal shows a table page with owners, quality status, a last refreshed timestamp, and a lineage graph
When a team changes a column name the catalog classifies it as a breaking change, opens a change request, and pings every downstream owner listed in lineage

Engineers find trusted data faster. Compliance teams get auditable controls. Leaders get a view of adoption by domain.

Common Pitfalls or Trade offs

Only table level lineage and no column level detail leads to noisy impact analysis and blocked releases
Stale metadata due to crawl only ingestion breaks trust and kills adoption
Tag sprawl without governance makes policies hard to reason about and often contradictory
Centralized approval for every change slows teams and drives shadow data pipelines
A shiny portal without APIs or events becomes a dead wiki that no system integrates
Policies written in one engine but enforced in another lead to gaps and surprise leaks
No version history means you cannot answer what changed last week when a dashboard broke

Interview Tip

Interviewers often push on freshness and enforcement. A strong move is to describe a two path ingestion model push from jobs plus pull crawlers and to explain how the policy engine intercepts requests in query engines and pipelines. Be ready to walk through a breaking schema change and the exact notifications and gates your design triggers.

Key Takeaways

Model metadata as first class entities and edges with version history and time
Capture lineage from both query parsing and job events to reach column level fidelity
Enforce governance with RBAC and ABAC backed by classification, masking, and full audit logs
Make discovery delightful with search, docs, examples, and a time aware lineage graph
Expose APIs and streams so the catalog becomes a platform not just a portal

Table of Comparison

Choice	When to Pick	Strengths	Risks / Trade-offs
Build vs Buy	Build if you have data engineering maturity and need custom lineage; buy if speed and compliance are top priorities	Full control and flexibility (build); faster deployment and support (buy)	Higher maintenance cost (build); vendor lock-in and limited customization (buy)
Graph Database vs Relational DB for Lineage	Graph for deep lineage and dependency traversal; relational for small-scale metadata	Fast graph traversal and impact analysis	Complexity of scaling and managing graph databases
Push-based vs Pull-based Metadata Ingestion	Push for near-real-time pipelines; pull for batch or legacy systems	High freshness and event-driven updates (push)	Requires pipeline instrumentation (push); stale data between crawls (pull)
Centralized vs Federated Governance	Centralized for regulated enterprises; federated for domain-oriented data mesh setups	Consistent policies and audit trail (centralized)	Bottlenecks and slower agility (centralized); policy drift across teams (federated)
Static vs Dynamic Policy Enforcement	Static for controlled ETL validation; dynamic for ad-hoc query enforcement	Catch issues before deployment (static); flexible runtime access control (dynamic)	Less flexibility (static); increased latency in enforcement (dynamic)

FAQs

Q1. What is the difference between a data catalog and a data dictionary?

A dictionary lists fields and definitions. A catalog adds ownership, lineage, usage, quality, policies, and a portal for discovery and collaboration.

Q2. How fresh should lineage be for daily analytics?

Within minutes is ideal. Achieve this with push events from jobs plus light crawls for backfill.

Q3. Do I need column level lineage from day one?

Start with table level to launch quickly but plan the model and processors for column level. Many impact and masking use cases require it.

Q4. Where should policies be enforced in the stack?

As close to the access point as possible. That usually means the query engine and the pipeline runtime. Always log the decision.

Q5. What is the right store for metadata at scale?

Use a graph store for lineage and relationships, a document or relational store for entity detail and history, and a search index for discovery. Each is good at a different access pattern.

Q6. How do I measure success of a catalog program?

Track search volume, active users, coverage of critical tables, lineage depth, time to owner, policy violations prevented, and incident time reduction.

Further Learning

To deepen your understanding of data catalogs, metadata services, and scalable governance design, explore these DesignGurus.io resources:

Grokking System Design Fundamentals Learn the building blocks of distributed systems and how to model metadata, relationships, and access controls effectively.
Grokking Scalable Systems for Interviews Master advanced topics like event-driven pipelines, data lineage graphs, and multi-region consistency—key for designing production-grade data catalogs.
Grokking the System Design Interview Practice interview-focused breakdowns of real-world systems like search, metadata services, and data discovery portals.

These courses collectively guide you from foundational understanding to scalable architecture design, ensuring you can confidently handle data lineage and governance questions in top-tier system design interviews.