How would you build graph traversal APIs for recommendations?
Building great recommendations from a graph is less about flashy algorithms and more about predictable latency, good edge data, and a clean API contract. Below is a practical guide that teaches you how to design traversal based recommendation APIs that scale in production and shine in a system design interview.
Introduction
A graph traversal API returns entities that are near a source node through paths that satisfy constraints. For example users near a user via follow edges or products near a product via co viewed edges. The service accepts a query such as source nodes, edge types, hop count, fanout caps, filters, and ranking signals, and produces a ranked list ready to serve. Think of it as a small path search engine that runs inside your recommendation stack.
Why It Matters
Recommendation quality depends on how well you can explore the local neighborhood of a user or an item while controlling cost. Traversals give you controllable exploration. You can pick hop limits for freshness, filter by edge labels for intent, and cap fanout for stable latency. Interviewers like this topic because it forces you to balance algorithmic thinking with serving concerns such as quotas, caching, and fallbacks. If you want a structured warm up before attempting full scale designs, the graph patterns section in Grokking System Design Fundamentals is a great place to start.
How It Works Step by Step
Step 1 Define the graph schema Identify node types such as user item topic and edge types such as follows likes purchases co watched. Store direction, weight, and timestamps. Add attributes needed for filtering such as language, region, maturity rating.
Step 2 Choose traversal patterns Pick common patterns and map them to use cases.
- One hop friends of user for social suggestions
- Two hop user to item to user to surface similar users
- Item to item via co occurrence for related products
- Meta path user to item to category to item to mix relevance with diversity
Step 3 Bound the search Set hard limits to keep tail latency predictable.
- Max hops often one or two for online serving
- Per hop fanout caps such as top N neighbors by weight or recency
- Path stop conditions such as score below threshold or already seen
Step 4 Score and rank Combine signals along the path.
- Edge weight and age with time decay
- Path length penalty shorter paths get bonus
- Global item quality such as popularity or safety
- Diversity boost to avoid near duplicates Use a simple additive or multiplicative formula that you can explain and tune.
Step 5 Decide online versus precompute Online traversal gives fresh results with higher tail risk. Precompute neighbor lists and keep them in a feature store or cache for fast read. Many teams mix the two. Precompute a candidate pool per entity each hour, then apply a lightweight traversal or filter at request time.
Step 6 Design the API contract Keep it explicit and guess free. Request
- sources list of node ids
- edge_types list such as follows or co_viewed
- hops integer
- fanout_per_hop list of integers
- filters key value pairs such as region equals PK or language equals en
- limit integer Response
- items list of node ids with score and path samples for debugging
- metadata which edge types and counters were used
Step 7 Storage and access paths Use sharded adjacency lists keyed by node id and edge type. Keep top neighbors sorted by weight and time. For mixed filters maintain small inverted indexes such as region to neighbor id bitsets. For higher flexibility mirror a subset in a search store to support attribute filters and pagination.
Step 8 Caching and freshness Cache hot traversals by request signature. Use short TTL for user based queries and slightly longer TTL for item based queries. Invalidate on important edge changes such as a new follow or a delete. Prewarm caches for trending items.
Step 9 Latency control Set per hop timeouts and hard caps. Fire hedged reads to alternate replicas if a shard is slow. Return partial results when a hop times out and mark them as partial so ranking can compensate.
Step 10 Safety and policy Apply allow and deny filters before ranking. Respect age, region, and sensitive topic constraints. Maintain an audit trail of which filters removed which candidates.
Step 11 Observability and evaluation Log per query counts such as nodes touched, edges examined, cache hit rate, and time per hop. Track recall against a held out evaluation set and business metrics in A B tests. Keep a replay tool that can re run a production query against a past snapshot.
Real World Example
Consider Instagram like people recommendations. The service wants to suggest accounts to follow. The schema has user nodes and edges for follows, mutual interactions, and profile co views. The online request asks for two hop traversal from a user with edge types follows and interactions, fanout ten on the first hop and five on the second. The service fetches top neighbors by recency weighted follow edges, mixes in interaction edges for stronger ties, then walks to their top followed accounts. It filters out accounts already followed and applies safety rules, then ranks by a blend of path strength, creator quality, and diversity across categories. To keep latency steady, it precomputes top creator neighbors per user daily and uses online traversal only to add a freshness boost from the past twenty four hours.
Common Pitfalls or Trade offs
-
Fanout explosion leads to timeouts and noisy results. Fix with per hop caps, early exit when cumulative score drops, and precomputed neighbor lists.
-
Over personalisation creates echo chambers. Introduce diversity boosts and meta paths that route through topics or geos.
-
Stale edges degrade trust. Use time decay and require a minimum number of recent interactions.
-
Cold start for new users or items. Fall back to popularity within cohort, content based similarity, or vector search over embeddings.
-
Complex query languages can creep into the API. Resist exposing a full graph query language to clients. Provide a small set of safe patterns with typed parameters.
-
Cross shard traversals can magnify tail latency. Co locate adjacency lists for connected communities when possible and consider partition aware routing.
Interview Tip
When asked to design a recommendation API, volunteer a latency budget and show how your traversal fits inside it. For example say we target p95 under one hundred and fifty milliseconds. We cap fanout at ten then five across two hops, use cached adjacency lists, and hedge shard reads after twenty milliseconds. We always return at least a popularity based fallback if traversal times out. This language shows control over uncertainty and is a strong signal in a system design interview.
Key Takeaways
-
A traversal API is a controlled exploration of a local graph neighborhood with strict caps on hops and fanout
-
Mixing online traversal with precomputed neighbor lists gives a good balance of freshness and stability
-
Ranking is a blend of path strength, item quality, and diversity with clear safety filters
-
Latency control requires timeouts, hedged requests, and partial results with graceful fallback
-
Keep the API small with typed parameters rather than a general query language
Table of Comparison
| Strategy | Strengths | Weaknesses | Latency profile | Freshness | When to use |
|---|---|---|---|---|---|
| Bounded BFS with caps | Simple, predictable, good for local signals | Misses far but useful nodes | Stable under caps | High | Social and follow based suggestions |
| Personalized PageRank style walk | Captures many paths with elegant math | Harder to reason about, parameter tuning | Moderate if precomputed, higher online | Medium to high | Mix of relevance and exploration |
| Random walk with restart | Natural diversity, robust to noise | Variable results, needs many samples | Variable | Medium | Discovery feeds and long tail surfacing |
| Precomputed item to item neighbors | Very fast serving, cheap online | Needs batch jobs and freshness strategy | Very low | Medium | Product detail similar items and watch next |
| Embedding search then graph rerank | Strong recall with context, flexible | Requires embedding pipeline and index | Low to moderate | High | Cold start or sparse graphs |
| Full graph database query | Expressive and flexible | Expensive to harden for low latency | High | High | Back office tools and experimentation |
FAQs
Q1. What is a graph traversal API for recommendations?
It is a service that explores a local neighborhood in a user or item graph under hop and fanout limits, applies filters and ranking, and returns a list of recommended entities.
Q2. How many hops should I allow for online serving?
One or two hops covers many useful cases. More than two often increases latency and noise without a clear gain.
Q3. How do I handle cold start users or items?
Use popularity within cohort, content based similarity, or embedding search to seed candidates, then apply the same ranking and policy filters.
Q4. Should I use a graph database or a key value store with adjacency lists?
For low latency serving a sharded key value store with sorted adjacency lists is a common choice. Graph databases are great for analysis and tooling.
Q5. How do I keep results fresh without blowing the budget?
Precompute neighbor lists on a batch cadence and add a small online freshness layer that boosts very recent edges with strict caps.
Q6. What metrics should I track to know the system is healthy?
Track p50 and p95 latency, cache hit rate, edges touched per query, truncation rate due to caps, and business metrics in A B tests.
Further Learning
Get a structured foundation of graph and caching patterns in Grokking System Design Fundamentals. Practice end to end recommendation architecture and latency control in Grokking Scalable Systems for Interviews. For interview focused drills that mirror industry formats, see Grokking the System Design Interview.
GET YOUR FREE
Coding Questions Catalog
$197

$78
$78