Sending emails is fundamental to every online business, and like with every system, there are multiple things we have to keep in mind during such design:
Satisfying all these parameters is an important pre-requisite to delivering world-class email systems that can be used to send millions of concurrent emails by millions of users with 100% reliability and assurance of timely delivery.
Here, we will talk about how we, Atlogys, have revamped our Email System design and architecture to create an ideal system and how we paved our way to success.
Let’s take a tour of an old email system, which takes 6 hours to send 1 lakh emails and thus causes much upheaval.
Nothing fancy is happening here. It is a big monolithic structure working in synchronous manner. There is a cron job running which is picking up cron data from Mysql and Mongo and sending out emails. Plus the cron is doing this in a sequential manner.
Main Problems of this Approach :
- Email Data is getting baked in sequential way
- Email sending mechanism is synchronous
- Long queue of emails caused new emails to wait till queue gets cleared which causes further delay in receiving emails and
- Heavy load on Mysql, till queue does not gets empty
Solutions we need to account for:
- Throughput – via Parallelism, heaviest mechanism should be in async manner,
- Fault-Tolerance, via Supervisor
- Availability, multi-node system
- Latency, divide email in groups depending on their type
- Reliability, no in-memory, but disk-based solution needed
Perfect fit for this description was RabbitMQ Message Bus, Pub-Sub Model :
We divided the email system into these parts :
- RabbitMQ High Availability Cluster
- Mongo Populator Cronjob
- Middle-level Publisher Cronjob (MPL)
- Consumers – 3 for announcements & 1 for Non-announcements
- Retry & Deque mechanism
- Healthcheck System
And our email system looks like this :
1. RabbitMQ High Availability Cluster
- Cluster : For this cluster set-up we need atleast 2 Rabbitmq nodes, and all nodes are connected via High Availability Policy, this policy ensures that if one node dies, other nodes elect the Master .
- Replication : 1st becomes Primary and other node will be secondary and both nodes will be in sync at any given point. All exchanges,channels and queues will be in sync at any point of time, if case of lag in any exchange, channel or queue we will be notified via healthcheck system.
- Cluster Health : Monitoring Cron will check lag between these nodes at each minute and if any lag is found, tech team will be notified via emails. Monitoring crons does a lot of other things.
- Durability : There are 2 options to save data in RabbitMQ, either in Disk or in RAM, we are saving data on both nodes in Disk , this increases durability of data , However this will impact performance a bit.
- Fault-Tolerance : Clustering itself decreases the chances of failure, So if primary node fails then Secondary will become Primary and healthcheck system will notify us by emails about the failure. However human intervention is needed to check error and to re-start failed rabbit node.
2. Mongo Populator Cronjob
This cronjob builds the data for emails by fetching records from mysql and mongo and inserts data in Mongo collection cron_emails, This cron is very fast and has a very high throughput.
3. Middle–Level Publisher Cron (MLP)
To get optimum improvement from both publisher and consumer side, we need to optimize the way we are executing the 5 mysql read queries for each email, Thus for 1 million emails, we were firing 5 million read mysql queries. We have created a new Cron for the optimization of these mysql queries.
This cron uses existing code and mysql queries are being fired in a chunk of 1000, So for 1 million emails, we will fire 1,000 queries, The throughput of this cron is good, and will be somewhere between 20k-25k per minute.
Once this middle-level cron builds the data, it pushes data to RabbitMQ in same 1000 chunk sizes and there are 2 options while we publish data to rabbitmq.
4. Consumers They picks up data in FIFO manner, so older data which was published to rabbitmq gets picked first , and consumer gets MongoID from RabbitMQ and we fetch mongo document json data from mongo and send emails, Consumer should not have any Mysql Query to get real performance.
So idea is not to use any mysql query in consumers, we shifted mysql queries from consumers to middle-level publisher (MPL).
5. Retry & Deque Mechanism,
If consumer crons fails sending email to some user, for such users there will be re-trials, We will try email delivery for 5 times, and if during re-attempts email gets successfully delivered, we treat that entry as successful and will mark status success in Mongo.
Dequeue : If some email wont goes even after 5 re-attempts, we will save the json of that email into mongo collection “dequed_mails”, and there is a cronjob which check for dequeued emails after every 2 hours and if it found any record in this mongo collection , dev team will be notified, then manual intervention will be needed .
6. Heartbeat System
To check whether RabbitMQ cluster is working fine or not, we are leveraging HTTP API provided by RabbitMQ.
We check these conditions of system :
- Whether Node(s) are up or down ?
- If there is any lag in both nodes ?
- If all queues are available and queues count is same ?
- If Channels/Exchanges are available on both nodes ?
- Add dummy message on a new queue and consume the same message at same time, to check publish/subscribe operations are working fine .
Hope this amount of technical details shed light on the attention to detail and the level of complex architecture that was architected at Atlogys to deliver this system.
Used in a Lin type progressive web platform that is used by 6 million scientists all over the world.
For a consulting session with our core technologists who helped design and build this system at Atlogys, please reach out to us at:
Schedule a 30 minute strategy consulting session with our IT experts at