How would you build a data catalog with lineage and governance?
A data catalog with lineage and governance is your organization wide index of datasets, columns, dashboards, jobs, and policies. It tells you what data you have, where it came from, who owns it, who can access it, and how it moves across pipelines. Think of it as a search engine plus a graph of flows plus a policy brain. In a system design interview this problem tests your ability to design a metadata first platform that scales, keeps facts fresh, and enforces rules without blocking delivery.
Why It Matters
A strong catalog raises confidence and speed in distributed systems and scalable architecture.
-
Faster discovery so engineers and analysts spend less time hunting for tables
-
Trust through visible lineage, quality signals, and owners on call
-
Safer access with fine grained policies, masking, and audits that satisfy compliance
-
Lower incident time by tracing broken data to the upstream change
-
Better change management with schema versioning and deprecation workflows
How It Works step by step
Below is a practical blueprint you can adapt during an interview or in production.
Step one define the metadata model
Start with an explicit model. Core entities usually include dataset, field, job, dashboard, user, group, domain, tag, policy, and incident. Relationships capture produces, consumes, owns, documents, and depends on. Add version and time fields everywhere.
- Table level and column level metadata
- Operational stats such as freshness, row count, and cost hints
- Business context such as domain, owner, purpose, and classification for privacy
Step two collect metadata from sources
Use connectors and crawlers to ingest schemas and stats from warehouses, lakes, orchestration tools, query engines, and BI tools. Two common patterns exist.
- Pull based crawlers that scan systems on a schedule
- Push based emitters in jobs that send events to the catalog during runs
Aim for both. Crawlers backfill. Emitters keep things fresh within minutes.
Step three build lineage
Lineage links sources, transforms, and sinks. Capture at both table level and column level.
- Parse queries and job plans to infer read and write sets
- Instrument pipelines to emit events with inputs and outputs
- Reconcile runs into a directed acyclic graph per dataset version
- Keep change history so you can answer questions at a past point in time
Column level lineage is gold for impact analysis and selective masking.
Step four store metadata in the right engines
Use a graph store for lineage and relationships. Use a document or relational store for entity detail and versions. Index key fields in a search engine for fast discovery.
- Graph store for nodes and edges at scale
- Document or relational store for entity snapshots and history
- Search index for full text with facets such as domain, owner, freshness, and tags
Step five add a policy engine for governance
Governance is not only access control. It includes classification, retention, and usage guardrails.
- Classify data with automated scanners for names, emails, and other sensitive classes
- Model policies with RBAC and ABAC so rules can reference roles, attributes, tags, and lineage
- Enforce policies in query engines with row level filters and column masking
- Log every decision with who, what, when, why for audits
Policies should be human readable and testable like code.
Step six surface discovery and context in a portal
The portal is where people search, browse, and contribute.
- Global search with synonyms, ranking by usage, and quick filters
- Dataset pages with schema, samples, owners, quality checks, downstream dashboards, and run history
- Lineage graph with time slider, column level detail, and impact analysis
- Edit flows for docs, owners, and tags with reviews and notifications
Documentation with examples and data contracts turns the portal into a living playbook.
Step seven integrate data quality signals
Quality tells you if data is fresh and correct enough to trust.
- Profile columns for distributions, nulls, and outliers
- Run tests in pipelines and collect pass or fail as metadata
- Compute data health scores that roll up to tables, domains, and business units
Show quality banners on dataset pages and block risky changes when needed.
Step eight support change management
You need to manage evolution in distributed data systems.
- Detect schema changes and classify them as compatible or breaking
- Gate production writes when an unapproved breaking change appears
- Provide a deprecation flow with owners, migration steps, and sunset dates
- Notify downstream owners automatically based on lineage
Step nine power programmatic access with APIs and events
Everything in the catalog should be accessible as APIs.
- Read and write APIs for entities, lineage, and policies
- Webhooks and streams for change events so other systems react in near real time
- Batch export and import for migration, backup, and offline analysis
Step ten plan for scale and reliability
Treat the catalog as a product with SLOs.
- Throughput targets for event ingestion and search queries
- Indexing queues with backpressure and retry logic
- Idempotent upserts so repeated events do not duplicate edges
- Caching hot metadata for the portal and policy checks
- Blue green upgrades for the portal and services to avoid downtime
Real World Example
Picture a streaming company with a lake on object storage, a warehouse for analytics, and a lakehouse engine for batch and streaming. Engineers run jobs through an orchestrator and analysts query through a SQL platform and build dashboards in a BI tool.
-
Crawlers scan the warehouse daily to ingest schemas and stats
-
Job wrappers emit run events with input and output datasets to an event bus
-
A lineage processor reconciles events into a graph so you can click from a dashboard back to every upstream table and job
-
A policy engine masks sensitive columns automatically for analysts outside the finance role
-
The portal shows a table page with owners, quality status, a last refreshed timestamp, and a lineage graph
-
When a team changes a column name the catalog classifies it as a breaking change, opens a change request, and pings every downstream owner listed in lineage
Engineers find trusted data faster. Compliance teams get auditable controls. Leaders get a view of adoption by domain.
Common Pitfalls or Trade offs
-
Only table level lineage and no column level detail leads to noisy impact analysis and blocked releases
-
Stale metadata due to crawl only ingestion breaks trust and kills adoption
-
Tag sprawl without governance makes policies hard to reason about and often contradictory
-
Centralized approval for every change slows teams and drives shadow data pipelines
-
A shiny portal without APIs or events becomes a dead wiki that no system integrates
-
Policies written in one engine but enforced in another lead to gaps and surprise leaks
-
No version history means you cannot answer what changed last week when a dashboard broke
Interview Tip
Interviewers often push on freshness and enforcement. A strong move is to describe a two path ingestion model push from jobs plus pull crawlers and to explain how the policy engine intercepts requests in query engines and pipelines. Be ready to walk through a breaking schema change and the exact notifications and gates your design triggers.
Key Takeaways
-
Model metadata as first class entities and edges with version history and time
-
Capture lineage from both query parsing and job events to reach column level fidelity
-
Enforce governance with RBAC and ABAC backed by classification, masking, and full audit logs
-
Make discovery delightful with search, docs, examples, and a time aware lineage graph
-
Expose APIs and streams so the catalog becomes a platform not just a portal
Table of Comparison
| Choice | When to Pick | Strengths | Risks / Trade-offs |
|---|---|---|---|
| Build vs Buy | Build if you have data engineering maturity and need custom lineage; buy if speed and compliance are top priorities | Full control and flexibility (build); faster deployment and support (buy) | Higher maintenance cost (build); vendor lock-in and limited customization (buy) |
| Graph Database vs Relational DB for Lineage | Graph for deep lineage and dependency traversal; relational for small-scale metadata | Fast graph traversal and impact analysis | Complexity of scaling and managing graph databases |
| Push-based vs Pull-based Metadata Ingestion | Push for near-real-time pipelines; pull for batch or legacy systems | High freshness and event-driven updates (push) | Requires pipeline instrumentation (push); stale data between crawls (pull) |
| Centralized vs Federated Governance | Centralized for regulated enterprises; federated for domain-oriented data mesh setups | Consistent policies and audit trail (centralized) | Bottlenecks and slower agility (centralized); policy drift across teams (federated) |
| Static vs Dynamic Policy Enforcement | Static for controlled ETL validation; dynamic for ad-hoc query enforcement | Catch issues before deployment (static); flexible runtime access control (dynamic) | Less flexibility (static); increased latency in enforcement (dynamic) |
FAQs
Q1. What is the difference between a data catalog and a data dictionary?
A dictionary lists fields and definitions. A catalog adds ownership, lineage, usage, quality, policies, and a portal for discovery and collaboration.
Q2. How fresh should lineage be for daily analytics?
Within minutes is ideal. Achieve this with push events from jobs plus light crawls for backfill.
Q3. Do I need column level lineage from day one?
Start with table level to launch quickly but plan the model and processors for column level. Many impact and masking use cases require it.
Q4. Where should policies be enforced in the stack?
As close to the access point as possible. That usually means the query engine and the pipeline runtime. Always log the decision.
Q5. What is the right store for metadata at scale?
Use a graph store for lineage and relationships, a document or relational store for entity detail and history, and a search index for discovery. Each is good at a different access pattern.
Q6. How do I measure success of a catalog program?
Track search volume, active users, coverage of critical tables, lineage depth, time to owner, policy violations prevented, and incident time reduction.
Further Learning
To deepen your understanding of data catalogs, metadata services, and scalable governance design, explore these DesignGurus.io resources:
-
Grokking System Design Fundamentals Learn the building blocks of distributed systems and how to model metadata, relationships, and access controls effectively.
-
Grokking Scalable Systems for Interviews Master advanced topics like event-driven pipelines, data lineage graphs, and multi-region consistency—key for designing production-grade data catalogs.
-
Grokking the System Design Interview Practice interview-focused breakdowns of real-world systems like search, metadata services, and data discovery portals.
These courses collectively guide you from foundational understanding to scalable architecture design, ensuring you can confidently handle data lineage and governance questions in top-tier system design interviews.
GET YOUR FREE
Coding Questions Catalog
$197

$78
$78