0% completed
1. System Definition
A notification system is a backend service that delivers timely information to users across various channels (email, SMS, push, in-app). Its purpose is to inform or alert users about events relevant to them, such as social updates, order statuses, reminders, etc. For example, after an e-commerce purchase, the system might send an order confirmation email, an SMS for payment confirmation, and a push notification when the package ships. In a social network, it would notify you (via in-app alert or push) when someone likes your post. Many real-world platforms (social media apps, online retailers, productivity tools) use such systems to increase user engagement and keep users informed in real time.
In our case, we are designing a multi-tenant notification service that can be shared by multiple applications (tenants). This means a single infrastructure will serve notifications for different products or client applications, each with its own rules and configurations. Real-world analogs include services like Twilio Notify, Amazon SNS, or custom internal services at companies like Facebook or Amazon that handle billions of notifications daily for various features. The system must handle billions of notifications per day across channels (email, SMS, push, in-app), supporting both immediate real-time delivery and batched (periodic or scheduled) delivery modes.
2. Requirements Clarification
Before diving into design, it's important to clarify the functional and non-functional requirements of the system.
Functional Requirements
- Multi-Channel Delivery: Support delivering notifications through various channels – email, SMS, mobile push notifications, and in-app messages. The same event may need to fan-out to multiple channels (e.g. send a promotional coupon via both email and push). The system will generate a separate channel-specific message for each medium, ensuring all channels are handled in parallel with no inherent ranking.
- Real-Time & Batched Notifications: Handle both immediate event-driven notifications and bulk or scheduled notifications. Real-time notifications (e.g. a chat message or alert) should be delivered within ~1–2 seconds to the user. The system also supports batching use cases – for instance, an app might schedule a daily digest or a marketing campaign to millions of users at once. This requires efficient bulk dispatch (possibly by scheduling or trigger-based batch jobs) in addition to one-off, on-demand sends.
- Message Scheduling: Allow notifications to be scheduled for future delivery (e.g. a reminder set to go out at a specific time). Instead of immediate dispatch, such requests are stored and sent at the correct time via a scheduler component. For example, a meeting reminder can be queued to send 10 minutes before the event.
- Retry on Failure: Implement retry mechanisms for failed deliveries. If a notification fails to send on a channel (due to transient errors like provider downtime or network issues), the system should retry sending after a short delay, possibly with exponential backoff. This ensures at-least-once delivery – the system will make multiple attempts so that temporary issues don’t cause message loss.
- User Preferences & Opt-Out: Respect user notification preferences. The system should check if the user has unsubscribed or opted out of certain notification types or channels. For example, if a user disabled SMS notifications or opted out of promotional emails, the system must honor that and skip those channels. It should also support user-specific settings like preferred channels for certain alert types (e.g. critical alerts via SMS, newsletters via email).
- No Channel Prioritization: All supported channels are treated equally by the system. If an event triggers notifications on multiple channels, each is delivered independently without prioritizing one over the other.
- In-App Notification Management: Support in-app messaging – notifications that appear within the application’s UI (e.g. the notification bell or inbox in a social app). This implies the system will store notifications in a database so that a user can fetch their notification list. It also may push in-app alerts in real-time to active user sessions (via WebSocket or similar) for immediacy.
- API for Notification Producers: Provide clean APIs or endpoints that various applications (social media, e-commerce, productivity tools, etc.) can call to create notification requests. Each request includes details like the user(s) to notify, the content (possibly a template ID and data), and which channel(s) to use. The API should handle single-recipient notifications as well as bulk sends (e.g. specifying a list of users or a user segment for broadcast).
Non-Functional Requirements
- Scalability: The design should handle extremely high volume – on the order of billions of notifications per day – without performance degradation.
- Low Latency: Real-time notifications should have end-to-end latency on the order of 100ms. Batch notifications can tolerate higher latency.
- High Availability: Ensure the notification service is always available to send messages, even in the face of failures. This implies removing single points of failure, using redundancy, and possibly deploying in multiple regions or data centers.
- Reliability and Delivery Guarantees: Strive for at-least-once delivery of each notification – every notification request should eventually result in a delivered message or an explicit failure logged. It’s better to occasionally deliver a duplicate message than to drop one entirely. For certain critical notifications, we might aim for exactly-once delivery (no duplicates) by deduplicating or using idempotent tokens, but this can be hard to guarantee at scale. The system should ensure no notifications are lost in transit (persisting them to durable storage/queues) and use confirmation receipts from external providers to verify delivery.
- Extensibility & Maintainability: The architecture should be modular, making it easy to:
- Add new notification channels in the future (e.g., adding WhatsApp or voice call notifications).
- Onboard new tenant applications with minimal effort (they might just provide configuration and start using the APIs).
- Update business rules or templates without major code changes (possibly using configuration or a rules engine).
- Scale specific parts of the system independently (for instance, if push notifications traffic grows, scale out the push delivery workers without necessarily scaling the email part). outs (CAN-SPAM, etc. for emails).
3. Capacity Estimation
We need to estimate the scale at which the system will operate to inform our design choices (e.g., how much infrastructure and what technologies are appropriate).
- Traffic Volume: Roughly 2 billion notifications/day (≈23k/sec on average), with about 1 KB each, resulting in ~2 TB of data daily.
- User Base: Suppose the combined user base of all applications is around 100 million users (some tenants might have tens of millions each). If each user on average receives 20 notifications per day (some get more, some less), that yields 2 billion notifications/day. Peak concurrency: millions of users could be online and receiving notifications simultaneously (for in-app and push channels).
- Infrastructure: Requires distributed message queues (e.g., Kafka) for high ingest rates, sharded databases/NoSQL for massive record volumes, and ample network bandwidth for peak loads (100k+ messages/sec).
- Read vs Write Workload: The system’s database workload will be write-heavy. Every notification results in a write (inserting a log record, updating an in-app notifications table, etc.), leading to billions of writes per day. Reads (such as retrieving a user’s notification list in-app, or querying logs for analytics) will be far fewer.
4. High-Level Design
Architecture Overview
At a high level, the notification system will be designed as a distributed, event-driven pipeline with multiple components cooperating asynchronously. The core idea is to decouple the act of generating a notification from the act of delivering it using a reliable queue, which allows the system to scale and to buffer bursts. Different components will handle: intake of requests, scheduling, queueing, processing per channel, and persistence. Each component can scale horizontally and be managed independently. Here's an outline of the architecture:
-
Producer (Client Application): This is not part of our system per se, but it's any external app or service (the tenants) that needs to send notifications. They will interface with our system through an API or message interface.
-
API Gateway / Load Balancer: Fronts the service, receiving API calls from various applications that want to send notifications. It handles authentication, rate limiting, and load balancing across multiple instances of the notification service.
-
Notification Service (Dispatcher): The core service that accepts notification requests, processes them (validates data, checks preferences, formats messages), and enqueues the notification for delivery. This acts as the brain of the system coordinating between upstream requests and downstream delivery channels.
-
User Preference Service/DB: A service or database that stores user notification settings – which channels they prefer for which types of notifications, do-not-disturb windows, subscription statuses, etc. The Notification Service queries this to filter or modify outgoing notifications according to user wishes.
-
Notification Queue System: A durable, high-throughput messaging/queue system (e.g. Apache Kafka, RabbitMQ, AWS SQS) that buffers notifications and decouples the ingestion of notification requests from the actual delivery processing. The Notification Service produces messages to the queue; downstream Channel Workers consume from it. This decoupling allows asynchronous processing and smooths out spikes in load, providing backpressure if needed.
-
Scheduler (for Delayed/Batched Jobs): A component responsible for scheduling notifications that aren’t meant to be sent immediately. This could be a separate Scheduler Service that stores future notifications (e.g. in a database or a priority queue) with their target send time. When the time arrives, it pushes those notifications into the main Queue for delivery. This enables features like "send an alert at 9 AM" or batch sends for campaigns.
-
Channel Workers: These are worker services specialized for each delivery channel: e.g., Email Sender, SMS Sender, Push Notification Sender, In-App Notifier. They pull messages from the Notification Queue and actually send the notification via the appropriate third-party service or protocol. Each channel worker knows how to format and communicate with the channel’s delivery mechanisms (for instance, the Email worker integrates with an SMTP server or email API service like SendGrid; the Push worker calls Apple or Google push services, etc.).
-
Notification Database(s): Storage for persistent data: user preferences, templates, and notification logs or in-app messages. This might be split into multiple specialized stores: e.g., a relational database for logs and delivered notification records, a NoSQL store for user preferences (to handle high lookup volume), and maybe a blob storage for any large content/attachments that are part of notifications. In-app notifications likely reside in a database table keyed by user so they can be retrieved when the user opens the app.
-
Template Service/Repository: (Optional) A component where notification templates are defined and stored. Rather than each request providing full text for every channel, clients could reference a template ID and data variables. The Notification Service would fetch the template and generate the actual message content per channel. This ensures consistency and easier updates to notification formats. Templates might be stored in a DB or even managed by the Notification Service itself if simple.
-
Monitoring & Analytics Services: (Supplementary) Components that gather metrics, logs, and health information from the system for monitoring, alerting, and analysis. Not part of the main data flow but critical for operations.
Data Flow (Notification Request to Delivery)
Let's walk through how a notification travels through the system in real-time mode, and in batch mode:
- Request Ingestion: An external application (producer) makes an API call to send a notification. For example, an e-commerce platform calls
POST /sendNotification
with the user’s ID, notification type (say, "OrderShipped"), and perhaps some content or template data. This request hits the API Gateway, which authenticates it (ensuring the caller is allowed to send notifications) and forwards it to one of the Notification Service instances. - Request Handling in Notification Service: The Notification Service receives the request. It validates the payload – confirming required information is present (such as a recipient identifier, message content or template, and at least one channel) and that it’s well-formed. It might also perform authentication/authorization checks if not done earlier. After validation, it quickly acknowledges receipt (especially if this is a synchronous API call) so the upstream caller can proceed, while the notification is handled asynchronously thereafter.
- Preference & Routing Logic: The Notification Service then determines how to route this notification. It queries the User Preference Service (or reads from a cached preferences store) to get the user’s current notification settings. This yields information such as: which channels the user has enabled, any channel overrides for this notification type, and whether the user has hit any rate limits or snooze settings. For example, if the user opted out of email for promos, and this is a promotional notification, the service would exclude email from the channels. It also checks system-wide rules (e.g., not sending more than N marketing notifications to a user per day) to decide if this notification should be throttled or dropped.
- Message Preparation: Now, the service knows which channels to send the notification through (based on the request and preferences). It prepares the content for each channel. If templates are used, the service fetches the template for, say, email and fills in the dynamic fields (like user name, order details). Each channel may require a slightly different payload – e.g., an email needs a subject and HTML body, whereas an SMS needs a short text, and a push notification might have a title, body, and custom data for the app. The Notification Service can transform the request into channel-specific messages.
- Enqueueing Messages: The service then enqueues the notification for delivery. Since there’s no priority among channels, it will enqueue a separate message for each channel the notification should go out on. This can be done in two ways: (a) using a unified queue/topic with the channel as part of the message data, or (b) using separate queues or topics for each channel (e.g., an "Email" queue, "SMS" queue, etc.). A common design is to use topic partitioning – for instance, a Kafka topic per channel – so that channel workers can consume their respective messages independently. For example, if a notification needs to go via Email, SMS, and Push, three messages are produced: one to the Email topic, one to the SMS topic, one to the Push topic. Each message includes the content and metadata needed for that channel (like the email subject or the SMS text, destination addresses, etc., along with a notification ID and perhaps a retry count). By pushing messages to a queue, we decouple the immediate request handling from the slower delivery process – the API call can complete quickly, and actual sending happens asynchronously.
- (Scheduled Notifications): If the request included a future send time or it’s part of a batch campaign, the Notification Service would invoke the Scheduler instead of immediately enqueueing for delivery. For a scheduled notification, the service stores the request (for example, in a Scheduler Service database or a delayed-queue) with the desired send timestamp. The Scheduler continuously scans or waits until the send time is reached, then moves those pending notifications into the regular delivery queue (feeding into step 5 at the appropriate time). This way, scheduled notifications join the same pipeline as real-time ones when their time comes. Batched campaigns might be handled by inserting many notifications into the scheduler or queue in bulk (possibly with their own tools to generate a large list of user-specific messages, as was done in the Duolingo Super Bowl campaign using precomputed user lists in S3).
- Queue Processing and Channel Delivery: The various Channel Processor services are subscribed to the Notification Queue (or specific topics). Each channel processor continuously pulls messages intended for its channel. For example, an Email Worker instance will read from the Email topic/queue. Once a message is pulled, the worker processes it: it establishes a connection to the relevant delivery provider or service and attempts to send the notification. This involves using channel-specific protocols or third-party APIs:
- The Email Processor might call an email sending service or SMTP server (such as SendGrid, Mailgun, or Amazon SES) to send out the email. It will format the email (HTML or text) if not already formatted, handle attachments or images, and send the message to the recipient’s email address.
- The SMS Processor will send the SMS text via an SMS gateway API (like Twilio or Nexmo). It may need to ensure the text fits within SMS length limits or split long messages, handle country codes, etc.
- The Push Notification Processor uses push services such as Firebase Cloud Messaging (FCM) for Android and Apple Push Notification service (APNs) for iOS. It prepares the payload (title, body, icon, any custom data) and sends it to the respective service, which will then deliver it to the app on the user’s device.
- The In-App Notification Processor will write the notification to the in-app notifications store (database) and if the user is currently online/connected, it can also deliver it in real-time via a persistent connection (for example, via a WebSocket message to the user’s session). This ensures the notification appears inside the app immediately if the user is active, and is stored for later retrieval if the user is offline.
Each of these processors operates independently, so an email notification could be sending at the same time as the SMS and push for the same event – there’s no requirement to do them in sequence. This parallelism improves overall latency for multi-channel notifications.
- Delivery Confirmation & Logging: After attempting to send, each channel processor gets a result – success or failure. For channels with synchronous APIs, the result is immediate; for some (like email), there might be callbacks or events for delivery/bounce later, but at least initial send status is known. The processor then logs the outcome of the notification into a Notification Logs store. A log entry typically includes: notification ID, user ID, channel, timestamp, status (sent, delivered, failed, etc.), and perhaps an error code or provider response if failed. These logs enable tracking and auditing – e.g. for customer support queries “I didn’t get this email” or for system health metrics. If the send was successful, the log is marked delivered; if it failed, the system may mark it for retry (and log an initial failure with maybe a pending status). All log entries are stored in a database built to handle high write volume. This could be a relational DB table or a time-series store or even a log indexing system, depending on query needs.
- User Notification Retrieval: For in-app notifications, the final step is allowing the user to read their notifications. The app’s client (e.g., mobile app or web frontend) can call an API (possibly the Notification Service or a dedicated Notification Query Service) to fetch recent notifications for that user. The service will query the Notifications DB for that user’s records (potentially using a cache for speed) and return them so the client can display, for example, a list of notifications (read/unread status, etc.). This read path is separate from the write path of sending notifications to maintain high throughput – often even separated into a different service or read replica to not impact send performance. The query service can also support features like marking notifications as read, or searching within notifications, but those are extensions beyond core delivery.
Throughout this flow, no single channel is prioritized or blocks another – each channel’s delivery is handled independently via the queue. This design ensures the fastest possible dispatch on all channels and isolates any slowness (for instance, an email service lag) to that channel’s worker without holding up others.
5. Data Storage and Schema
In this hybrid notification system, SQL tables store structured, relational data (like user info, preferences, and templates) for consistency and ease of management, while NoSQL tables handle high-volume, unstructured event data (notification history and statuses) for scalability. This approach leverages the strengths of both: relational databases ensure ACID compliance for critical metadata, and NoSQL databases provide flexible schemas and horizontal scaling to handle millions of notification records efficiently. Below is the schema design with each table’s fields, data types, and descriptions in tabular format, along with explanations of design considerations before and after the tables for clarity.
SQL Tables
The SQL tables manage static or slowly changing data such as tenant configurations, user profiles, notification preferences, templates, scheduled jobs, and audit logs. These tables are normalized and use clear relationships (foreign keys) to maintain data integrity. Each table is outlined with its fields:
tenants
Stores tenant (application) details. Each tenant represents a client application or organization using the notification system.
Field Name | Data Type | Description |
---|---|---|
tenant_id | INT (PK) | Unique identifier for the tenant (primary key). |
name | VARCHAR(255) | Name of the tenant application or organization. |
contact_email | VARCHAR(255) | Contact email for the tenant’s administrator or support. |
created_at | DATETIME | Timestamp when the tenant was registered. |
status | VARCHAR(50) | Status of the tenant (e.g., “active”, “inactive”), used to enable/disable notifications for this tenant. |
Explanation: The tenants table ensures multi-tenancy support by segregating data per application. It includes basic identification and contact info for each tenant, along with a status flag to control activity. This structured information is well-suited for a relational table since it’s relatively static and requires consistency.
users
Stores users who receive notifications. Users are associated with tenants.
Field Name | Data Type | Description |
---|---|---|
user_id | INT (PK) | Unique identifier for the user (primary key). |
tenant_id | INT (FK) | Identifier of the tenant this user belongs to (foreign key to tenants ). |
name | VARCHAR(100) | Full name of the user. |
VARCHAR(255) | Email address of the user (used for email notifications). | |
phone | VARCHAR(20) | Phone number of the user (used for SMS or phone notifications). |
created_at | DATETIME | Timestamp when the user account was created. |
status | VARCHAR(50) | Account status of the user (e.g., “active”, “disabled”). |
Explanation: The users table holds profile and contact information for notification recipients. It relates to the tenants table via tenant_id
to ensure each user is tied to the correct tenant (application). Storing users in SQL guarantees referential integrity (e.g., notifications link to valid users) and makes it easy to join with preferences or templates if needed.
user_preferences
Stores per-user notification preferences, such as which types of notifications or channels a user has enabled or disabled.
Field Name | Data Type | Description |
---|---|---|
user_id | INT (PK, FK) | Identifier of the user (primary key, foreign key to users ). Each user has one preferences record. |
email_notifications | BOOLEAN | Whether the user wants to receive email notifications (true = opt-in). |
sms_notifications | BOOLEAN | Whether the user wants to receive SMS/text notifications. |
push_notifications | BOOLEAN | Whether the user wants to receive push/in-app notifications. |
language_preference | VARCHAR(10) | (Optional) Preferred language or locale for notifications (e.g., "en", "es"). |
updated_at | DATETIME | Timestamp of the last update to this user’s preferences. |
Explanation: The user_preferences table captures notification settings for each user. By separating preferences from the main user profile, the system can easily extend or modify preference fields without altering core user data. Each user has one row containing various preference flags (for different channels or categories of notifications). In a relational model, this structured approach simplifies queries for preference-controlled sends (for example, checking if a user has opted into email before sending).
notification_templates
Stores reusable message templates for notifications. Templates allow consistent formatting of messages and reuse across multiple notifications.
Field Name | Data Type | Description |
---|---|---|
template_id | INT (PK) | Unique identifier for the notification template. |
tenant_id | INT (FK) | Tenant that owns this template (foreign key to tenants ), allowing tenants to have custom templates. |
name | VARCHAR(100) | Template name or key (e.g., “WelcomeEmail”, “PasswordReset”). |
subject | VARCHAR(255) | Subject line for the notification (if applicable, e.g., for email notifications). |
body | TEXT | Body content of the template, with placeholders for dynamic data (e.g., “Hello {username}, ...”). |
channel | VARCHAR(50) | Channel/type of notification this template is for (e.g., “email”, “sms”, “push”). |
created_at | DATETIME | Timestamp when the template was created. |
updated_at | DATETIME | Timestamp when the template was last modified. |
Explanation: The notification_templates table enables definining message formats once and reusing them. Each template can be specific to a channel and possibly personalized with placeholders. Storing templates in SQL ensures they can be transactionally updated (e.g., updating the content or disabling a template) with versioning if needed. Since templates are relatively static and shared data, an SQL table is appropriate for strong consistency (so that notifications always use the latest approved content for compliance).
scheduled_notifications
Stores notifications that are scheduled for future delivery (e.g., reminders or announcements that should be sent at a later time or on a specific schedule).
Field Name | Data Type | Description |
---|---|---|
schedule_id | INT (PK) | Unique identifier for the scheduled notification entry. |
tenant_id | INT (FK) | Tenant context for the notification (foreign key to tenants ), helpful for multi-tenant filtering. |
user_id | INT (FK) | The intended recipient user’s ID (foreign key to users ). |
template_id | INT (FK) | Reference to the notification template to use (foreign key to notification_templates ). |
scheduled_time | DATETIME | Date and time when the notification is scheduled to be sent. |
status | VARCHAR(50) | Current status of the scheduled notification (e.g., “scheduled”, “sent”, “canceled”). |
created_at | DATETIME | Timestamp when this schedule was created. |
created_by | VARCHAR(50) | Identifier of who scheduled the notification (could be a user ID or system/admin user). |
Explanation: The scheduled_notifications table holds messages that need to be sent in the future. Each entry ties a user with a template to send at a specific time. Using SQL here allows for reliable scheduling (with transactions to avoid duplicate scheduling or race conditions) and easy querying (e.g., find all notifications to send in the next hour). The status
field tracks whether the job is still pending, already sent, or cancelled, and can be updated as the job is processed. When the scheduled time arrives, the system will retrieve these entries and then create actual notification events (which get recorded in the NoSQL notifications history).
audit_logs
Stores logs for compliance and tracking important system events. This includes records of notifications sent, preference changes, template edits, or any administrative actions, to provide an audit trail.
Field Name | Data Type | Description |
---|---|---|
log_id | BIGINT (PK) | Unique identifier for the audit log entry. |
tenant_id | INT (FK) | Tenant associated with the event (foreign key to tenants ), if applicable. |
user_id | INT (FK) | User associated with the event (if applicable, e.g., the user who was notified or whose preferences changed). |
action | VARCHAR(100) | Description of the action or event (e.g., “NOTIFICATION_SENT”, “PREFERENCE_UPDATED”, “TEMPLATE_MODIFIED”). |
details | TEXT | Additional details about the event (e.g., notification content snippet, old vs new values for changes, error messages if any). |
performed_by | VARCHAR(100) | Who performed the action (could be a user id, admin id, or system process name). |
timestamp | DATETIME | Date and time when the event was logged. |
Explanation: The audit_logs table provides a historical record of significant events for compliance auditing and debugging. By storing these in SQL, we ensure they are immutable and safely stored (with ACID guarantees) — important for compliance requirements. For example, whenever a notification is sent, an entry could be logged here (in addition to the NoSQL history) to satisfy regulatory needs that require easily queryable records of communication. It can also log user preference changes or template changes to track who did what and when. This structured log data can be filtered by tenant or user for audit reports.
NoSQL Tables
The NoSQL collections handle the high-scale data: the actual notifications sent to users and the tracking of their delivery/read status. Using a NoSQL database (like MongoDB or Cassandra) for this part of the system allows it to scale horizontally to millions of records and high write throughput with flexible schema. Each notification event is stored as a document, and any updates to its delivery status are recorded separately.
notifications
(NoSQL collection) Stores the notification history for users – each record represents a notification event (a message sent to a user).
Field Name | Data Type | Description |
---|---|---|
notification_id | UUID/String (PK) | Unique identifier for the notification event (could be a UUID or auto-generated ObjectID in a document store). |
tenant_id | INT or String | Tenant identifier for context (duplicates the tenant for quick filtering in NoSQL, since joins aren’t used). |
user_id | INT or String | ID of the user who received the notification. |
template_id | INT or String | Reference to the template used (if any) for this notification. |
message | TEXT/JSON | The content of the notification sent (could be text, JSON for structured content, etc.). |
channel | STRING | Delivery channel used (e.g., “email”, “SMS”, “push”). |
sent_at | DATETIME | Timestamp when the notification was sent (or created in the system). |
metadata | JSON (optional) | Additional metadata for the notification (e.g., context info like subject line, recipient info, or template parameters used). |
Explanation: The notifications collection holds each notification that has been generated and sent to users. Storing these in a NoSQL database enables handling a massive volume of events — for example, every single notification for every user can be logged without impacting the performance of the transactional system. Each document includes the content and context of the notification, which is useful for building a notification history or inbox feature for users. We include references like template_id
to know which template was used (if needed for reconstructing or analyzing messages), and store the actual message
content so that the history is preserved even if templates change over time. The channel
helps identify how it was delivered (for example, to filter user history by channel). Storing tenant_id
and user_id
with each record (even though that duplicates relational data) is intentional in NoSQL design to optimize queries by those fields (common access patterns are fetching all notifications for a given user or tenant). This denormalization is acceptable given NoSQL’s focus on read performance and scalability. According to a scalable design approach, each notification event should capture all data needed without requiring joins, which is achieved by including message content and context directly in the document.
notification_status
(NoSQL collection) Tracks the delivery status of notifications, separate from the notification content. This is used to monitor whether notifications have been delivered, read, failed, etc., and to update their status over time.
Field Name | Data Type | Description |
---|---|---|
status_id | UUID/String (PK) | Unique identifier for this status record (could be a unique ID or use composite key of notification+timestamp). |
notification_id | UUID/String | Identifier of the notification this status update corresponds to (foreign key reference to a record in notifications collection). |
user_id | INT or String | ID of the user who is the recipient of the notification (for convenient querying by user, duplicates from notifications ). |
status | STRING | Delivery status value (e.g., “sent”, “delivered”, “opened”, “failed”, “read”). |
updated_at | DATETIME | Timestamp when this status was recorded (when the status changed or was reported). |
details | JSON (optional) | Additional details about the status, if any (e.g., error message if failed, device info for delivery, or read timestamp). |
Explanation: The notification_status collection is designed to keep track of the evolving status of each notification. By separating status tracking into its own collection, the system can record multiple status changes over time without rewriting the main notification record. For example, a notification might be sent at time A, delivered at time B, and read by the user at time C – each of these can be a separate entry in notification_status
. This separation improves write performance (as status updates are high-frequency and can be appended as new documents) and avoids contention on the main notification record. It also allows querying the status history independently (for analytics or troubleshooting). Each status entry references the notification_id
and the user_id
for ease of lookup. This design is similar to having a NotificationReceivers/Statuses collection in other systems, which links a notification to its recipient and status. By logging statuses (like read/unread or delivered/failed) separately, we can efficiently track who has seen or received each notification and scale to large numbers of events and users without heavy joins.
- Dead Letter Queue (DLQ) storage: This is for notifications that failed to send after all retry attempts. We can have a separate persistent store or queue for these. A simple approach: a table
FailedNotifications(notification_id, user_id, channel, error_code, last_attempt_time)
that the ops team can review. Or integrate with a message queue’s DLQ feature (e.g., a Kafka topic or RabbitMQ DLQ) where messages go after max retries. These records might be periodically reprocessed (maybe after the underlying issue is resolved) or at least analyzed. They are also useful for debugging issues with providers.
Notification Scheduler & Batch Processing
To support scheduled and batch notifications, a Scheduler component is included. This could be as simple as a database of future notifications with a background job, or a distributed scheduling system. One implementation: use a table ScheduledNotifications(id, user, content, channels, send_time)
where notifications are stored if send_time
is in the future. A Scheduler Service periodically scans for due notifications (or a lightweight cron job queries for items where send_time <= now()
), then enqueues them to the Notification Queue for delivery. For precision (when a large batch needs to go out exactly at a specific time), the system might pre-fetch those notifications slightly before due time and stage them. In high-scale cases, a more sophisticated approach is needed: e.g., use a delayed queue or priority queue data structure. Some message queues (like RabbitMQ or AWS SQS with delay) support delayed messages natively. Another approach is to partition scheduled jobs by time (e.g. per minute) and have distributed workers pick up the correct bucket at the right time. The key is that scheduling logic runs in a fault-tolerant way: possibly multiple scheduler nodes for redundancy, using locks or leader election to avoid duplicate sends. If a scheduled notification is missed (e.g., scheduler down at that minute), the system should detect and recover or have a manual fallback.
For batch campaigns (e.g., sending a marketing email to 10 million users), it’s inefficient for a client to call the API 10 million times. Instead, we might provide a bulk interface or offline loading mechanism. For instance, an admin could upload a list of target user IDs for the campaign. The system (as described by the Duolingo case) might pre-fetch those users’ data, store the list in a file or DB, and then when triggered, rapidly push all those notifications through the pipeline. The workers and queue must be scaled up to handle this burst. To avoid overwhelming downstream providers (like an email service), we might intentionally throttle the consumption rate or use multiple provider accounts spread over the load.
Worker Architecture and Queuing Mechanisms
The Channel Processors (workers) are designed to be stateless and horizontally scalable. Each worker instance is a consumer from the queue. For a high-throughput system, a distributed log/queue like Kafka is a good choice because it can handle very high message rates and partition the stream across many consumers. We would create (for example) separate Kafka topics for Email, SMS, Push, etc., each with multiple partitions. The number of partitions defines the maximum parallelism – we can have one consumer thread per partition. If we anticipate 100k notifications/sec on peak, and a single consumer can process say 1k messages/sec, we’d need on the order of 100 consumers working in parallel across partitions. We might allocate, say, 50 partitions for email, 20 for SMS (if volume is lower), 30 for push, etc., based on expected load per channel. Each partition could be processed by one consumer instance, and we can always add more consumers up to the partition count. Workers can be auto-scaled based on queue backlog or system load. Each message includes a unique notification ID and perhaps a deduplication key (especially if using at-least-once delivery semantics, so if the same message reappears due to a retry, the worker can detect it has seen that ID before and skip or update instead of duplicate sending). Workers also maintain a retry count or delivery status in memory or via the log metadata.
Queue Semantics: The queue should guarantee at-least-once delivery of messages to the workers, meaning a message will be retried if a worker crashes mid-processing. Kafka by default provides at-least-once (consumers commit offsets after processing). We could also use a system like RabbitMQ which can requeue un-ACKed messages. In either case, if a worker fails after partially sending, we need idempotency to avoid duplicates. Alternatively, an exactly-once pipeline could be built with more complexity (e.g., Kafka transactions, idempotent producers/consumers). Many large systems choose at-least-once with de-duplication on the consumer side as needed. Deduplication could involve checking a cache or database of recently sent notification IDs.
No Prioritization: Since “no prioritization between channels” is required, we are not weighting the queue processing in favor of any channel. All channel topics are processed as fast as possible. Priorities can still be handled within a channel if needed (for example, maybe an SMS could have normal vs high priority messages in different queues), but the problem statement says no prioritization across channels, so we treat them equally.
Ordering Considerations: Generally, notifications to the same user via the same channel should be delivered in order of generation. Using a partitioning scheme where the partition key is the user ID (or some hash of it) can ensure that all notifications for a given user go to the same partition (and thus are processed in sequence). This prevents, say, an “Order Shipped” notification from overtaking the “Order Placed” notification for the same user. Cross-user ordering doesn’t matter. Partitioning by user also distributes load roughly evenly if user IDs are random. If certain users generate a lot of notifications (e.g. a very active user), that partition could see heavier load – in extreme cases we might partition by a composite key (user and maybe notification type) or just accept some imbalance for ordering guarantees. Another approach is partition by notification type or channel-specific logic (ensuring, for example, all promotional emails are spread out). The design can incorporate sharding at multiple levels: e.g., user-based sharding for preferences DB and partitioning by time for logs, but for the queue, user-based partitioning is a simple strategy to preserve order per recipient.
Retry and Failure Handling
Retry Policies: Each channel processor will implement a retry mechanism for transient failures. Common strategy is exponential backoff – if a send fails, wait a short interval and try again, increasing the wait time on each failure. For example, after 1st failure wait 1s, after 2nd wait 5s, then 30s, etc. We will configure a maximum number of retries (say 3 attempts) per notification per channel. Some channels might have specific retry rules: e.g., for SMS, if the first attempt fails due to a carrier issue, it might be pointless to retry quickly – maybe we retry after a longer delay or use an alternate SMS provider if available. For email, a common pattern is to retry a couple of times over a few minutes for temporary SMTP issues. Push notifications typically either succeed or not; if a device token is invalid, a retry won’t help (the failure is permanent for that token). So the worker might classify errors into permanent vs transient. Permanent errors (invalid address, unregistered device, etc.) will not be retried excessively; they might be logged and dropped immediately. Transient ones (server timeouts, rate limit exceeded, etc.) trigger the retry logic.
The system can implement retries in a few ways. One is in the worker process itself with an in-memory timer or loop. Another robust way at scale is to publish the failed message to a retry queue (or back onto the main queue with a delay). For instance, if using Kafka, we might have separate topics like Email_Retry or use a field in the message for attempt count and have workers requeue the message with an incremented attempt count and a timestamp to not process until a certain time. Some setups use a dead-letter queue pattern: after N failed attempts, the message goes to a DLQ instead of retrying further.
Deduplication: With at-least-once delivery and retries, deduplicating notifications is important so users don’t get spammed with duplicates. We handle this by using a unique notification ID for each notification request (the API should provide or the system generates one). This ID travels with all channel messages. Workers or the Notification Service can keep a cache of recently seen IDs or check in the database before sending. For example, the Notification Service could store a short-lived record of "notification request processed" keyed by that ID, so if the same request comes again (due to client retry or message duplication), it ignores it. On the consumer side, if a worker sees a message ID that was already processed (could check a Redis set or an entry in the logs), it will skip sending. Idempotent operations are also key: if we accidentally send the same email twice, many email providers have ways to detect duplicate IDs if we pass a message ID, but we shouldn’t rely solely on that. Within our system, a combination of careful commit handling in the queue and idempotency checks prevents duplicates from normal operation. The Duolingo example explicitly mentioned ensuring no duplicate messages even if two triggers fired, highlighting the need for idempotency in the face of concurrent events.
Failure Handling and Fallbacks: Not all failures are temporary. If a channel is down or a third-party service is having an outage, we may want to failover or fallback. For instance, if our primary SMS provider fails, a secondary provider could be used. The system can be configured with multiple integrations per channel, ranked by preference. On failure to deliver via the first, it can try the second. This improves reliability at the cost of complexity and possibly higher cost. Large systems often integrate multiple vendors for critical channels (e.g., two email service providers) and implement logic to route around outages. Similarly, if sending push notifications, if one region of FCM is unresponsive, we might try another region’s endpoint. These fallbacks should still obey the no-duplication rule (only one should ultimately succeed).
If a notification ultimately fails after retries (e.g., email bounced, or we exhausted retries for a transient error), the system should record that failure in the log and possibly move the notification to a Dead Letter Queue for manual review. Operators can inspect DLQ messages and decide to reprocess them later or alert the source service that the notification was not delivered. For example, if an email keeps bouncing, perhaps the user’s email is invalid – the system could flag that user’s email contact in the preferences as invalid to avoid future attempts.
Graceful Degradation: Under extreme load or partial failures, the system should degrade gracefully. For instance, if the notification database is down, the system might still attempt to send notifications but skip logging to the DB (or log to an in-memory buffer or fallback store to flush later). If the queue is backed up (indicating downstream slowness), the rate of accepting new requests might be throttled via the API gateway or using backpressure signals. The idea is to prevent total crashes – e.g., shed non-critical load if needed (maybe drop or delay lower priority notifications, though by requirement we aren’t prioritizing channels, an extension could be prioritizing notification types in an overload scenario).
Additional Features
Rate Limiting per User/Channel: To prevent a user from being overwhelmed by notifications (especially promotional), the system should enforce limits. For example, not more than X promotional notifications per day per user. This requires counting notifications sent (which could be done via the logs or a separate counter service) and checking before sending. If a limit is exceeded, the system can defer or drop some notifications (or combine them into a digest). This is typically part of the business logic in the Notification Service or User Preferences evaluation.
Notification Format and Personalization: The system design allows personalization by using templates and data. It also should handle localization (different languages for notifications, possibly choosing template based on user locale). Templates and content might be stored or managed through a CMS that the Notification Service can query, but at design time it suffices to note templates exist and are fetched when composing messages.
Security Considerations: Ensure that the Notification Service APIs are protected (e.g., with OAuth tokens or API keys) so that only authorized systems (like the company’s own backends) can send notifications – otherwise spammers could abuse it. Also, validate content to prevent injection (especially if content might be HTML for email, or if we allow some rich content in notifications). The system should also not leak data – e.g., one user shouldn’t be able to query another’s notifications due to proper auth checks.
7. Scalability and Performance Considerations
Designing for 3B notifications/day requires careful attention to scalability. Key strategies include distributing load, parallelizing work, and eliminating bottlenecks:
-
Horizontal Scaling & Load Balancing: All components are horizontally scalable. We will run multiple instances of the Notification Service behind a load balancer, so incoming API requests are distributed evenly. Similarly, we scale out channel worker instances – if more throughput is needed for email sending, we add more email worker processes or machines, each consuming from the queue partitions. The queue system (Kafka, for example) itself runs as a cluster on multiple brokers and can partition data across nodes, allowing it to handle more messages. By scaling horizontally, we can linearly increase capacity by adding more machines rather than needing any single machine to be extremely powerful. The system should be designed such that there is no single point where all traffic funnels through one instance (to avoid overload).
-
Efficient Queueing and Parallelism: Using a high-throughput distributed queue like Kafka ensures we can ingest and distribute messages quickly. Kafka can handle millions of messages per second in large clusters, so 100k/sec is feasible with a proper setup. We will tune partition counts to achieve the desired parallelism. Additionally, queue producers and consumers should use batching where possible – for example, the Notification Service might send messages to Kafka in batches to reduce overhead, and consumers might commit offsets in batches. The asynchronous design also means the front-end API isn’t waiting on the slowest part of delivery, improving perceived performance.
-
Caching & Fast Data Access: To keep latency low, we use caching for frequently accessed data like user preferences and templates. A cache like Redis can store user preference records in memory, avoiding a database lookup for each notification. This reduces latency in the critical path (preferences check) to microseconds. Similarly, if templates are stored in a DB or fetched from a service, we can cache them in memory within the Notification Service. Another caching aspect is for the read path: when a user fetches notifications, we could cache their last N notifications in memory (or use an in-memory store as a speed layer) to serve the data quickly, especially if they check frequently.
-
Database Partitioning and Sharding: The data stores will be split to spread load. For example, we might shard the notification logs by user ID or by region. User-based sharding could put users with certain ID ranges or certain geographic regions into separate databases. This prevents any single DB from becoming a write hotspot for all 3B daily notifications. Time-based partitioning of logs is also useful – e.g., have a separate partition/table per day or per month, so that writes for the current day go into a hot partition while old partitions are mostly static (and can even be moved to cheaper storage). This improves insert performance and makes purging old data easier (just drop old partitions). For user preferences, sharding might not be necessary if using a scalable NoSQL store, but if using SQL, we could partition that by user as well. The key is to divide and conquer the data such that no single node or small cluster handles the full 3B/day load.
-
Load Balancing at External Integrations: Ensure that calls to external providers (email/SMS gateways, push services) are also load balanced. For example, use multiple SMTP endpoints or multiple connections to the provider so that one slow connection doesn’t bottleneck the worker. If using cloud provider services (SES, FCM), they typically handle scaling on their side, but we must be mindful of their rate limits and possibly request increases in quota. For SMS, if using Twilio, we might use multiple phone numbers or messaging service pools to send in parallel at scale. If any single provider cannot handle our peak (e.g., an SMS vendor might throttle), we should integrate additional providers and split traffic, as mentioned earlier for reliability.
.....
.....
.....
Table of Contents
Contents are not accessible
Contents are not accessible
Contents are not accessible
Contents are not accessible
Contents are not accessible