How do you set up synthetic monitoring and availability probes?

Synthetic monitoring simulates user actions from outside your system to validate availability and performance. Availability probes operate inside your infrastructure to determine if an instance or service should receive traffic. Used together, they provide both external and internal visibility into system health, improving reliability and reducing downtime.

Why It Matters

In a system design interview, understanding synthetic monitoring and probes shows you can think beyond building features. You demonstrate readiness for real-world reliability challenges. Synthetic checks reveal user-facing issues like certificate expiry or CDN failure. Probes ensure that unhealthy instances are removed from load balancers automatically. Together, they form the foundation of an observable and resilient architecture.

How It Works (Step-by-Step)

1. Define Critical Journeys Identify core user flows such as login, checkout, or video playback. These are your “golden paths” that synthetic checks will simulate.

2. Choose Check Types

HTTP Checks: Validate URL reachability and expected responses.
API Checks: Test request-response correctness with sample payloads.
Browser Journeys: Use headless browsers to perform real clicks and flows.
Network Checks: Measure DNS, TCP, and TLS handshake success.

3. Select Vantage Points Run synthetic tests from multiple regions and networks. This helps detect geo-specific issues and routing errors.

4. Configure Frequency and Timeouts Critical flows can run every 1 minute while less critical checks may run every 5 minutes. Apply appropriate timeouts to avoid false positives.

5. Define Alerting Rules Alert only after multiple consecutive failures across different regions. Use aggregation and suppression to minimize noise.

6. Build Health Endpoints for Probes

Liveness Probe: Checks if the app process is running.
Readiness Probe: Confirms if the service is ready to accept traffic.
Startup Probe: Ensures slow-booting services are not restarted prematurely.

7. Integrate with Load Balancers and Orchestrators Orchestrators like Kubernetes use readiness and liveness probes to control traffic routing and restarts. Configure intervals and thresholds to prevent flapping.

8. Automate Pre- and Post-Deployment Checks Run synthetic tests before deployment to ensure baseline health, and after rollout to catch regressions early.

9. Trend and Analyze Data Track metrics such as availability percentage, latency distribution, and regional failure rate. Use this data to adjust Service Level Objectives (SLOs).

Real-World Example

Consider Amazon’s e-commerce platform. Synthetic checks simulate “add to cart” and “checkout” every minute from different continents. Availability probes run inside each EC2 instance, marking them “ready” only after caches are loaded and dependencies are healthy. When a data center experiences latency spikes, synthetic checks flag the issue, while probes help Kubernetes drain unhealthy pods automatically.

Common Pitfalls or Trade-offs

Overly Strict Probes: Services may restart unnecessarily during transient dependency failures.
Unrealistic Synthetic Flows: Mocked logins or bypassed security can hide real user issues.
Single Region Monitoring: Misses regional failures or CDN misconfigurations.
Alert Fatigue: Poorly tuned alert thresholds create noise and desensitize teams.
Blind Spots: Synthetic checks don’t reflect load conditions; combine with telemetry and tracing.

Interview Tip

Interviewers often test your operational mindset by asking, “How would you detect an expired TLS certificate before users are affected?” A great answer is to describe an automated daily TLS synthetic check that triggers alerts 30 days before expiry and blocks deployment if under 7 days.

Key Takeaways

Combine synthetic monitoring (external) and availability probes (internal) for complete coverage.
Run checks from multiple regions to detect user-facing issues early.
Tune probes to prevent flapping or unnecessary restarts.
Store, trend, and visualize metrics to improve reliability over time.
Automate pre-release and post-release synthetic validations.

Table of Comparison

Approach	Primary Goal	Signal Source	Strengths	Limitations	Best Use Case
Synthetic Monitoring	Validate real user journeys	External agents	Detects DNS, TLS, CDN issues	May miss load-induced problems	Top user flows and post-deploy checks
Availability Probes	Control routing and restarts	Internal service	Fast failure detection	Limited to internal view	Managing service health in clusters
Real User Monitoring	Measure actual experience	User devices	True user perspective	Slow detection	Long-term UX trends
Tracing & Metrics	Explain internal causes	Application telemetry	Deep insights	Needs instrumentation	Debugging and performance tuning

FAQs

Q1. What is the main difference between liveness and readiness probes?

Liveness checks if the process is still running, while readiness confirms if the instance can serve traffic. Use liveness for recovery, readiness for routing.

Q2. How often should I run synthetic checks?

Run top business flows every minute from at least two regions, and secondary flows every five minutes for balanced coverage.

Q3. Can probes check external dependencies?

Ideally, no. Readiness should verify only what’s essential for safe traffic. External dependency checks belong in synthetic monitoring.

Q4. How can I prevent false positives in synthetic checks?

Run tests from multiple regions, randomize timing, require multiple consecutive failures, and verify both status code and response content.

Q5. What metrics should I collect from probes and synthetic checks?

Collect uptime percentage, latency percentiles (p95/p99), regional distribution, and failure cause categories such as DNS, TLS, or application errors.

Q6. How do I secure health endpoints?

Expose only minimal information, restrict access to internal networks, and avoid sensitive data in probe responses.

Further Learning

Strengthen your reliability and observability foundation in Grokking System Design Fundamentals.
Learn to reason about trade-offs, monitoring, and alerting at scale in Grokking the System Design Interview.
Dive deeper into distributed observability and resilience strategies in Grokking Scalable Systems for Interviews.