On this page
Client-Side vs Server-Side Discovery
Client-Side Discovery
Server-Side Discovery
Approach 1: DNS-Based Service Discovery
How It Works
Advantages
Limitations
Approach 2: HashiCorp Consul
How It Works
Health Checking
Multi-Datacenter Federation
When to Use Consul
Approach 3: Kubernetes-Native Service Discovery
How It Works
Headless Services
Health Checking
Limitations
Choosing the Right Approach
Service Registration Patterns
Self-Registration
Third-Party Registration
Sidecar Registration
DNS Record Types for Service Discovery
Service Discovery and Service Mesh
Real-World Patterns
Single Kubernetes Cluster (Most Common)
Kubernetes + Legacy VMs
Multi-Region Active-Active
Hybrid Cloud (AWS + On-Premises)
How This Shows Up in System Design Interviews
Common Mistakes
Conclusion: Key Takeaways
Service Discovery Patterns: DNS, Consul, and Kubernetes-Native Approaches


On This Page
Client-Side vs Server-Side Discovery
Client-Side Discovery
Server-Side Discovery
Approach 1: DNS-Based Service Discovery
How It Works
Advantages
Limitations
Approach 2: HashiCorp Consul
How It Works
Health Checking
Multi-Datacenter Federation
When to Use Consul
Approach 3: Kubernetes-Native Service Discovery
How It Works
Headless Services
Health Checking
Limitations
Choosing the Right Approach
Service Registration Patterns
Self-Registration
Third-Party Registration
Sidecar Registration
DNS Record Types for Service Discovery
Service Discovery and Service Mesh
Real-World Patterns
Single Kubernetes Cluster (Most Common)
Kubernetes + Legacy VMs
Multi-Region Active-Active
Hybrid Cloud (AWS + On-Premises)
How This Shows Up in System Design Interviews
Common Mistakes
Conclusion: Key Takeaways
What This Blog Covers
- What service discovery is and why hardcoding addresses fails at scale
- Client-side vs server-side discovery patterns
- Three approaches: DNS-based, Consul, and Kubernetes-native
- Health checking and how each approach handles unhealthy instances
- When to use each approach (and when to combine them)
- How to discuss service discovery in system design interviews
You deploy a payment service with three instances behind a load balancer.
The order service needs to call the payment service.
You hardcode the load balancer's IP address in the order service's configuration. It works. Then the load balancer's IP changes during a infrastructure migration.
Every service that calls the payment service breaks simultaneously. You spend four hours updating configuration files and redeploying.
This is the problem service discovery solves.
In a microservices architecture, services need to find each other dynamically.
Instances scale up and down.
Containers are created and destroyed.
IP addresses change constantly.
Hardcoding addresses is brittle and breaks whenever the infrastructure changes.
Service discovery maintains a registry of available service instances and their network locations. When a service needs to call another service, it queries the registry instead of relying on hardcoded addresses.
The registry knows which instances are healthy and where they are.
For understanding how service discovery fits into the broader microservices architecture, A Guide to Modern Microservices [2026 Edition] covers the full landscape.
Client-Side vs Server-Side Discovery
Before examining specific tools, you need to understand the two fundamental patterns.
For a detailed comparison of these two patterns, System Design Deep Dive: Client-Side vs. Server-Side Service Discovery covers the architectural trade-offs.
Client-Side Discovery
The client queries the service registry directly, receives a list of healthy instances, and chooses one (using round-robin, random, or weighted selection).
The client is responsible for load balancing.
Advantages: No additional infrastructure between client and server. The client can implement sophisticated load balancing (least connections, latency-based routing). No single point of failure in the routing path.
Disadvantages: Every client needs a discovery library. In a polyglot environment (Java, Python, Go, Node.js), you need a library for each language. The discovery logic is coupled to the client application.
Server-Side Discovery
The client sends requests to a router or load balancer.
The router queries the service registry, selects a healthy instance, and forwards the request.
The client does not know about the registry or the individual instances.
Advantages: Client is decoupled from discovery. No language-specific libraries needed. The router handles load balancing centrally.
Disadvantages: The router is an additional infrastructure component that must be highly available. It adds a network hop and latency.
Kubernetes Services and AWS Elastic Load Balancer are server-side discovery.
Consul with client-side libraries is client-side discovery.
Most production architectures use server-side discovery because it is simpler for application developers.
Approach 1: DNS-Based Service Discovery
DNS is the oldest and simplest form of service discovery.
A service registers its instances as DNS records.
Clients resolve the service name to get the IP addresses of available instances.
For understanding how DNS resolution works end-to-end, From URL to Server: The Junior Engineer's Guide to DNS and Traffic Routing covers the full chain.
How It Works
The service payment-service registers three instances as A records: payment-service.internal resolves to 10.0.1.5, 10.0.2.8, and 10.0.3.12. When the order service calls payment-service.internal, DNS returns all three IPs. The client connects to one of them.
SRV records provide richer information than A records: they include the port number, weight (for weighted load balancing), and priority (for failover ordering).
A SRV query for _payment._tcp.internal returns the host, port, weight, and priority for each instance.
Advantages
- Universal compatibility: Every programming language, framework, and operating system can resolve DNS. No special libraries or SDKs needed. DNS-based discovery works for any application without modification.
- Simple to implement: DNS infrastructure already exists in every network. Adding service records to an existing DNS server requires minimal effort.
- No vendor lock-in: DNS is a standard protocol. You can switch between DNS providers without changing application code.
Limitations
- No health checking: DNS itself has no concept of instance health. A DNS server will continue returning the IP of a crashed instance until someone manually removes the record or an external health checker updates it. Building reliable discovery on DNS requires an external system to manage records.
- TTL caching delays: DNS responses are cached based on TTL. If an instance dies and the DNS record is updated, clients with cached entries will continue connecting to the dead instance until their cache expires. Low TTLs (5 to 30 seconds) reduce this window but increase DNS query volume.
- Limited metadata: DNS records carry minimal information (IP, port, weight, priority). You cannot attach metadata like service version, deployment environment, or custom tags that enable advanced routing decisions.
Approach 2: HashiCorp Consul
Consul is a purpose-built service discovery and configuration platform. It provides a centralized service registry, distributed health checking, a DNS interface for backward compatibility, an HTTP API for programmatic access, and a key-value store for configuration.
How It Works
Every node runs a Consul agent.
When a service starts, its local agent registers the service with the Consul servers (name, IP, port, health check definition, metadata tags).
Consul servers replicate this information across the cluster using the Raft consensus protocol.
When a service needs to discover another service, it queries Consul via DNS (payment-service.service.consul) or the HTTP API (/v1/health/service/payment-service).
Consul returns only healthy instances because it continuously runs health checks against every registered instance.
Health Checking
Consul's distributed health checking is its most significant advantage over DNS. Each agent runs health checks locally on the services registered to its node.
Checks can be HTTP (call an endpoint, expect 200), TCP (attempt a connection), script-based (run a command, check exit code), or TTL-based (the service must heartbeat within a deadline).
When a check fails, the agent immediately updates the service's status in the catalog. Subsequent queries return only healthy instances.
No TTL caching delay. No stale entries pointing to dead instances.
Multi-Datacenter Federation
Consul supports service discovery across multiple datacenters natively. Each datacenter runs its own Consul cluster.
Clusters are federated via a WAN gossip pool that links servers across datacenters.
A service in datacenter A can discover and connect to a service in datacenter B.
This is critical for disaster recovery and geographic distribution.
If all instances of a service fail in one datacenter, clients can automatically discover instances in another datacenter.
When to Use Consul
Consul is the right choice when you operate across multiple platforms (Kubernetes + VMs + serverless), need multi-datacenter service discovery, require rich health checking beyond simple DNS, need the key-value store for dynamic configuration, or want a service mesh (Consul Connect provides mTLS and access control between services).
Approach 3: Kubernetes-Native Service Discovery
Kubernetes has built-in service discovery that requires no additional infrastructure. For understanding the Kubernetes architecture that enables this, System Design Interview: The Architecture of Kubernetes (Control Plane vs. Worker Nodes) covers the control plane components.
How It Works
When you create a Kubernetes Service, Kubernetes assigns it a stable virtual IP (ClusterIP) and a DNS name.
The DNS name follows the pattern <service-name>.<namespace>.svc.cluster.local.
CoreDNS (Kubernetes' built-in DNS server) resolves this name to the Service's ClusterIP. kube-proxy on each node programs iptables or IPVS rules to forward traffic from the ClusterIP to one of the healthy pod IPs backing the Service.
The Service acts as a stable abstraction over a set of pods.
Pods can be created and destroyed. Their IPs change.
But the Service's DNS name and ClusterIP remain constant. Clients always connect to the same DNS name, and Kubernetes handles routing to healthy pods.
Headless Services
For client-side discovery in Kubernetes, you can create a headless Service (ClusterIP set to None). Instead of returning a single virtual IP, DNS returns the individual pod IPs.
The client receives all pod IPs and implements its own load balancing.
gRPC clients commonly use headless Services because gRPC maintains long-lived connections and needs to balance across multiple pods, which a single ClusterIP with round-robin does not handle well.
Health Checking
Kubernetes health checking uses liveness probes (is the container alive?), readiness probes (is the container ready to serve traffic?), and startup probes (has the container finished initializing?).
A pod that fails its readiness probe is removed from the Service's endpoint list.
Traffic stops flowing to it. When the pod becomes ready again, it is re-added.
This is integrated directly into the platform. No additional agents or configuration.
Health checking is defined in the pod's deployment manifest alongside the application configuration.
Limitations
Single-cluster only. Kubernetes DNS works within one cluster. Cross-cluster service discovery requires additional tooling (Consul, external DNS, or Kubernetes federation).
No VM or serverless support. Kubernetes-native discovery only works for workloads running as pods in the cluster. If some services run on VMs or as serverless functions, they cannot participate in Kubernetes DNS natively.
Limited metadata in DNS. Kubernetes DNS returns IP addresses but cannot attach custom metadata to service entries. Advanced routing based on service version, canary weights, or custom labels requires a service mesh (Istio, Linkerd) on top of Kubernetes.
Choosing the Right Approach
-
Kubernetes-only environment, single cluster: Use Kubernetes-native discovery. It is built-in, requires no additional infrastructure, and integrates with health checking natively. This covers the majority of use cases for teams running entirely on Kubernetes.
-
Multi-platform (Kubernetes + VMs + bare metal): Use Consul. Its agent-based model works on any platform, and its DNS interface provides backward compatibility with applications that cannot use Consul's API directly.
-
Multi-datacenter or multi-cloud: Use Consul with WAN federation. Kubernetes DNS does not span clusters. Consul's multi-datacenter federation enables cross-region service discovery without custom solutions.
-
Simple, small-scale deployment: DNS-based discovery with low TTLs and external health checking may be sufficient. It requires no additional infrastructure beyond your existing DNS server.
-
Service mesh requirements: Consul Connect or Istio/Linkerd on Kubernetes. Service meshes add mTLS, traffic policies, and observability on top of discovery. Choose based on your platform.
Service Registration Patterns
How services get into the registry is as important as how they are discovered.
Self-Registration
The service instance registers itself with the registry on startup and deregisters on shutdown. This is the pattern Consul uses: each service calls the Consul agent's registration API during initialization.
The service is responsible for its own lifecycle in the registry.
Advantage: No external dependency for registration. The service knows when it is ready.
Disadvantage: Registration logic is coupled to the application. Every service in every language must implement registration.
Third-Party Registration
An external component (the platform) handles registration automatically.
The service does not know about the registry. This is how Kubernetes works: when a pod is created, the Kubernetes control plane automatically adds its IP to the corresponding Service's endpoint list. When the pod is deleted, Kubernetes removes it.
Advantage: Zero registration code in the application. Services are unaware of the discovery mechanism.
Disadvantage: Depends on the platform. Services running outside Kubernetes (VMs, serverless) cannot use this pattern without additional tooling.
Sidecar Registration
A sidecar proxy (Envoy, Consul Connect proxy) handles registration, health checking, and discovery on behalf of the application. The application communicates only with localhost. The sidecar handles all service mesh concerns. This is the pattern used by Istio, Linkerd, and Consul Connect.
Advantage: Application is completely decoupled from discovery. Works in polyglot environments.
Disadvantage: Adds a sidecar container to every pod, consuming additional CPU and memory.
DNS Record Types for Service Discovery
Understanding DNS record types is important for implementing DNS-based discovery correctly.
- A records map a hostname to an IP address. The simplest form of service discovery:
payment-service.internal → 10.0.1.5. Multiple A records for the same name enable basic round-robin load balancing. The DNS client receives all IPs and typically connects to the first one (or a random one, depending on the resolver). - SRV records provide richer information: priority, weight, port, and target hostname. SRV records enable weighted load balancing (send 70% of traffic to one instance, 30% to another) and priority-based failover (prefer primary instances, fall back to secondary). Consul and Kubernetes CoreDNS both support SRV queries.
- CNAME records alias one domain name to another. Useful for pointing a friendly service name to a load balancer's hostname:
payment.api.internal → payment-lb.us-east.elb.amazonaws.com. The CNAME is resolved to the load balancer's A records, which resolve to the actual IPs.
Service Discovery and Service Mesh
Service meshes build on service discovery by adding security (mTLS between services), traffic management (canary deployments, circuit breaking, retries), and observability (distributed tracing, metrics) to the communication layer.
Istio on Kubernetes uses the Kubernetes API server as its service registry. Envoy sidecar proxies intercept all traffic and apply routing rules defined by Istio's control plane. Discovery is Kubernetes-native; the mesh adds the security and traffic management layer.
Consul Connect uses Consul's service catalog as the registry and deploys sidecar proxies for mTLS and access control. Because Consul works across platforms (Kubernetes, VMs, serverless), Consul Connect provides service mesh capabilities in heterogeneous environments where Istio (Kubernetes-only) cannot reach.
Linkerd is a lightweight Kubernetes-native service mesh that uses Kubernetes Services for discovery and adds mTLS, retries, and observability with minimal configuration overhead.
For teams already running Kubernetes-native discovery, adding a service mesh is an incremental step that preserves the existing discovery model while adding security and traffic management.
Real-World Patterns
Single Kubernetes Cluster (Most Common)
Use Kubernetes Services with CoreDNS.
Define readiness probes for health checking.
Use headless Services for gRPC.
No additional infrastructure needed. This covers 80% of microservices architectures.
Kubernetes + Legacy VMs
Deploy Consul alongside Kubernetes.
Kubernetes services register with Consul via the Consul-Kubernetes sync feature.
VM services register directly with Consul agents.
Both Kubernetes and VM services discover each other through Consul's DNS or HTTP API.
Multi-Region Active-Active
Deploy Consul clusters in each region with WAN federation. Services register locally. Cross-region discovery enables failover: if all instances in Region A fail, clients discover instances in Region B through Consul's federated catalog.
Health checks run locally in each region, and only healthy instances are returned.
Hybrid Cloud (AWS + On-Premises)
Consul agents run in both environments.
On-premises services register with the on-premises Consul cluster.
AWS services register with the AWS Consul cluster.
WAN federation connects both clusters.
Services discover each other regardless of where they run.
This pattern is common in enterprises migrating from on-premises to cloud, where some services have moved to Kubernetes on AWS while others remain on legacy VMs in the data center.
Consul provides a unified discovery layer that bridges both worlds during the transition.
How This Shows Up in System Design Interviews
Service discovery comes up when interviewers ask about microservices communication, scaling, or deployment architecture.
Here is how to present it:
"For service-to-service communication, I would use Kubernetes Services for discovery within the cluster. Each microservice is exposed via a ClusterIP Service with a stable DNS name. Readiness probes ensure only healthy pods receive traffic. For the gRPC service that needs client-side load balancing across multiple pods, I would use a headless Service so the client receives all pod IPs. If the architecture spans multiple clusters or includes non-Kubernetes workloads, I would add Consul for cross-platform discovery with health checking and WAN federation for multi-region failover."
For structured practice on system design problems involving service communication patterns, the Grokking the System Design Interview course covers microservices architecture and infrastructure patterns.
Common Mistakes
-
Hardcoding IP addresses or hostnames in application code. Even in Kubernetes, hardcoding pod IPs breaks when pods restart. Always use service names and let the discovery mechanism resolve them.
-
Using DNS with high TTLs for dynamic environments. A TTL of 300 seconds means a dead instance receives traffic for up to 5 minutes. In environments where instances scale frequently, use TTLs of 5 to 30 seconds or switch to a registry-based approach like Consul or Kubernetes Services.
-
Ignoring the gRPC load balancing problem. gRPC uses long-lived HTTP/2 connections. A Kubernetes ClusterIP Service load-balances at connection time, not per-request. Once a gRPC client establishes a connection to one pod, all requests go to that pod. Use headless Services with client-side load balancing or a service mesh for gRPC.
-
Deploying Consul when Kubernetes-native discovery suffices. If all your services run in a single Kubernetes cluster, Consul adds operational complexity without significant benefit. Kubernetes Services handle discovery, health checking, and load balancing natively.
-
Not implementing graceful shutdown. When an instance shuts down, it should deregister from the service registry before stopping. Without graceful shutdown, the registry continues routing traffic to the dying instance until health checks detect the failure. In Kubernetes, the pre-stop hook and readiness probe handle this automatically if configured correctly.
-
No fallback when the registry is unavailable. If Consul's servers are temporarily unreachable, client applications should cache the last known good set of endpoints and continue operating. A service discovery outage should not cascade into a full system outage.
Conclusion: Key Takeaways
-
Service discovery replaces hardcoded addresses with dynamic lookups. Services register their instances, and clients query the registry to find healthy endpoints. This is essential for any microservices architecture.
-
Two patterns: client-side and server-side. Client-side (client queries registry directly) gives more control. Server-side (router handles discovery) is simpler for applications. Kubernetes Services are server-side. Consul supports both.
-
DNS is universal but limited. No health checking, TTL caching delays, limited metadata. Works for simple, small-scale deployments.
-
Consul is the multi-platform, multi-datacenter choice. Agent-based, distributed health checking, DNS and HTTP interfaces, WAN federation, key-value store. Use when spanning Kubernetes, VMs, and multiple regions.
-
Kubernetes-native discovery is the default for single-cluster Kubernetes. Built-in, zero additional infrastructure, integrated health checking. Use headless Services for gRPC client-side load balancing.
-
In interviews, match the approach to the infrastructure. Kubernetes-only? Native Services. Multi-platform? Consul. gRPC? Headless Services. Multi-region? Consul federation. Name the specific pattern and tool.
What our users say
pikacodes
I've tried every possible resource (Blind 75, Neetcode, YouTube, Cracking the Coding Interview, Udemy) and idk if it was just the right time or everything finally clicked but everything's been so easy to grasp recently with Grokking the Coding Interview!
AHMET HANIF
Whoever put this together, you folks are life savers. Thank you :)
Arijeet
Just completed the “Grokking the system design interview”. It's amazing and super informative. Have come across very few courses that are as good as this!
Access to 50+ courses
New content added monthly
Certificate of completion
$29.08
/month
Billed Annually
Recommended Course

Grokking the System Design Interview
169,039+ students
4.7
Grokking the System Design Interview is a comprehensive course for system design interview. It provides a step-by-step guide to answering system design questions.
View Course