How would you build configuration‑as‑data (typed, validated, auditable)?

Configuration as data means you treat configuration like structured data that lives in a reliable store with a clear schema, strict validation, and a full audit trail. Instead of sprinkling constants through code or relying on ad hoc environment variables, you define types, validate changes before rollout, and deliver configuration through a control plane with safety checks. This gives teams a consistent and testable way to ship changes without redeploys while keeping reliability high.

Why It Matters

Interviewers love this topic because it reveals how you balance safety and speed in distributed systems. Typed and validated configuration reduces outages from bad values, supports gradual rollouts, and enables fast incident response through instant rollback. An auditable pathway with who, what, when, and why also satisfies compliance needs in regulated environments and improves on call debugging.

How It Works Step by Step

Step 1. Start with a schema Model each config object with a schema language such as Protocol Buffers, JSON Schema, Avro, or an API schema. Capture field types, required fields, default values, ranges, and enums. Add cross field invariants like percent values sum to one or a feature is enabled only if dependencies are true. Version the schema so older services can read newer data safely.

Step 2. Author through a product grade workflow Treat changes like code. Use pull requests, peer review, and automated checks. Keep base values in a global layer, then add environment overlays for staging and production. For multi tenant products, allow scoped overrides per tenant, region, device, or user cohort. Never embed secrets in config. Store references to a secrets manager instead.

Step 3. Validate before rollout Run static checks. Schema validation, linting, and policy as code rules with Open Policy Agent style constraints. Run dynamic checks. Dry run against synthetic traffic or a shadow environment. Enforce referential integrity. A feature flag reference must exist, a region name must be valid, a percentage rollout must be within allowed bounds. Reject any change that violates invariants.

Step 4. Store for consistency and auditability Use a transactional store with versioned snapshots. Git based flows are common because you get history for free, but a database with append only audit tables also works. Record actor identity, timestamp, diff, and a human written reason. Tag each release with a changelog entry and keep a pointer to last known good.

Step 5. Publish through a control plane A controller watches the source of truth and publishes to a distribution layer. Clients subscribe via streaming, long polling with ETag, or short polling with backoff. The control plane applies policy, computes effective values per scope, and signs payloads to prevent tampering. It can also pause rollouts or auto rollback when health checks degrade.

Step 6. Roll out safely Use progressive delivery. Target internal users first, then five percent of traffic, then twenty five percent, and so on. Gate changes by region, device type, or app version. Collect metrics and error budgets during the ramp. Roll back instantly to last known good if any SLO is threatened.

Step 7. Enforce types at read time Provide generated client libraries that expose typed getters. Callers cannot read a string where an integer is expected. Libraries supply defaults and time to live semantics and log every read to support change attribution later. Services cache locally and refresh atomically so there is no mixed partial state.

Step 8. Observe and audit continuously Attach config version and change id to every request log and metric. When an incident happens, you can correlate a spike with a specific change and actor. Keep a searchable audit trail that answers who changed what and why. This is critical for compliance and for fast root cause analysis.

Step 9. Plan for failure and recovery If the control plane is unreachable, services keep operating with a cached and signed snapshot. Protect against bad caches with expiry windows and a freeze mode that blocks risky edits during peak periods. Practice disaster recovery by restoring a snapshot to a clean cluster and verifying that clients converge.

Step 10. Keep compatibility and migrations simple Use additive changes. Add new fields with safe defaults and deprecate over time. When a breaking change is unavoidable, ship code that understands both shapes and migrate gradually with a temporary translation layer.

Real World Example

Think about a global streaming platform that ships a new video prefetch algorithm. The team defines a typed config with fields like enable, max parallel fetches, backoff policy, and eligible device list. They add invariants such as parallel fetches cannot exceed a device specific limit and the device list must match a registry. The change lands behind a cohort rule. Employees first, then one region at five percent, then three regions at twenty five percent. The control plane stamps each rollout with a change id. If mobile crash rates rise in a region, the system auto rolls back there while keeping the ramp in other healthy regions.

Common Pitfalls or Trade offs

Untyped key value bags lead to silent failures. Always start with a schema and generated clients for type safe reads.
Logic in templates turns data into code and bypasses validation. Keep config declarative. If you need conditionals, model them as expressions with a parser and validator.
Secret sprawl is risky. Use a secrets manager and store only references in config.
Global rollouts without guardrails create wide outages. Use staged ramps with health gates and automatic rollback.
Drift across environments confuses debugging. Use overlays with inheritance rules and a tool that computes the effective config for any entity and environment.
Weak auditing slows incidents. Record actor, reason, diff, and link to tickets in an append only trail.

Interview Tip

A strong answer mentions schema, validation, rollout safety, typed clients, and audit. Bonus points if you call out last known good snapshots, cohort based rollouts, and the ability to freeze edits during peak traffic. Sketch a quick flow. Author, validate, publish, ramp, observe, and roll back.

Key Takeaways

Treat configuration like product data with schema, validation, and auditability.
Use a control plane to publish typed config and to manage safe rollouts.
Keep config declarative and separate from secrets.
Plan for failure with signed snapshots, expiry, and instant rollback.
Capture who changed what and why to accelerate incident response.

Table of Comparison

Approach	Best Fit	Type Safety	Validation Strength	Audit Trail	Change Risk
Configuration as Data	Large scale services with frequent changes and compliance needs	Strong through schema and generated clients	Strong with static and dynamic checks	Full history with reasons and diffs	Low with staged rollout and instant rollback
Constants in Code	Simple apps with rare changes	Strong at compile time	Moderate through tests	Good through version control	High because changes need redeploys
Environment Variables	Boot time values and secrets references	Weak typed as strings	Weak limited checks	Limited across fleets	Moderate risk and hard to audit at scale
Feature Flag Service	Runtime switches and experiments	Moderate with flag typed wrappers	Moderate policy plus rule validation	Good per flag history	Low for on off toggles, not a full config system

FAQs

Q1. What is configuration as data in a system design interview?

It is a pattern where configuration is modeled as typed data stored in a reliable source of truth. Changes go through validation and a control plane publishes them with safety and audit.

Q2. How do I validate configuration changes before rollout?

Use schema checks, policy rules, and dry runs against a staging environment. Add smoke tests and health gates so a change pauses or rolls back when SLOs degrade.

Q3. How do I prevent a bad config from causing an outage?

Roll out gradually, keep a last known good snapshot, and wire automatic rollback into your control plane. Cache signed snapshots in services so they keep working if the control plane is down.

Q4. How is configuration as data different from feature flags?

Flags are usually boolean switches with simple rules, great for experiments and kill switches. Configuration as data models richer structures and enforces invariants across many fields.

Q5. Where should secrets live if not in config?

Use a secrets manager and store references only. Clients resolve the reference at runtime with short lived credentials and strict access control.

Q6. What store should I use for configuration at scale?

Git works for many teams because it provides history and review. A database with append only audit tables is also common. The key is atomic versioning, durability, and easy rollback.

Further Learning

To dive deeper into configuration systems, reliability, and scalable change management:

Grokking Scalable Systems for Interviews – Learn how to design control planes, rollout mechanisms, and resilient infrastructure for large-scale distributed systems.
Grokking System Design Fundamentals – Build a strong foundation in APIs, caching, data modeling, and consistency principles essential for designing typed and validated configuration systems.
Grokking the System Design Interview – Practice end-to-end interview case studies like building configuration platforms, feature flag systems, and scalable control planes used in FAANG interviews.