blü

WhatsApp is the most popular method of communication in Brazil, with over 150 milion active users, the app is used both as a way to communicate with friends and family, but also as a way to communicate with businesses.

Because of this, a common problem businesses have is that they allow users to subscribe to their services to know about events and promotions with their phone number, but they have no way of easily broadcasting their messages on a personal level to the user base, since WhatsApp doesn't offer functionalities like this.

To solve this challenge, I partnered with business owner in the health insurance sector and a UX designer to define, create, launch and maintain blü – a platform focused on empowering Brazilian businesses to communicate with their customers through WhatsApp.

Since launching in September/2022, blü is used by over 100 businesses with over 10,000 monthly active users communicating with over 1 million customers daily in average.


problem overview

WhatsApp can be used in two main ways:

  • Using the mobile app
  • Using the WhatsApp Web interface

To enhance user experience with extended WhatsApp features, we needed to interact with one of these platforms. Summarized:

Given the higher complexity of interacting with a mobile app, I opted for using the web interface, which in itself offered several choices I could make. After considering the pros and cons, I ruled out going with a local solution, aiming for a cloud-based solution that users could easily access from their phones. The chosen approach is similar to the one recently used by Nothing for allowing iMessage access:

"...it’s literally signing in on some Mac Mini in a server farm somewhere, and that Mac Mini will then do all of the routing for you to make this happen." source

After this, browser automation was the clear choice for a first iteration for its simplicity and extensive documentation.

The decision: a cloud service using Node that interacts with WhatsApp via Puppeteer.

first iterations - v0

Our first client, the healthcare clinic owned by one of the co-founder in our team, provided a real-world testing ground from the start. We built the frontend with Angular and used Firebase for hosting, database, and authentication.

Puppeteer's reliance on Chromium added overhead due to high CPU and RAM usage. To address potential scaling issues, I explored serverless options like Google Cloud Functions and Google Cloud Run.

However, after experimenting with Cloud Functions for a few days, its short lifespan - 9 minutes on 1st gen - made me discard them as an option. Cloud Run on the other hand showed itself more promising, but later the 1 hour timeout also started being a bottleneck, added to its high costs, it led me to seek a dedicated server solution.

second iteration - v1

Cost-effective hosting was crucial. After comparing GCP, AWS EC2, and other providers, I chose Hetzner for its affordability and excellent customer support. This move to a dedicated server significantly improved our service stability.

As user numbers grew, we scaled up vertically with Hetzner's high-end machines. Eventually, we reached their maximum tier, and started expanding horizontally until reaching four big servers, using a simple client-side load-balancing algorithm that surprisingly met our needs for a considerable period.

Final system design during v1.

third iteration pt. 1 - v2

With a growing user base, I aimed to solve scaling issues by considering a switch from Puppeteer to a direct WebSocket connection. I also planned to migrate from Firestore to a more cost-effective database, ultimately choosing AWS RDS Aurora for its familiarity.

Due to the small code base and my deep knowledge of it, I chose to rewrite our backend service, and gradually transition users, instead of doing a gradual migration.

The new database and server setup cut our total costs in over 90%.

The number of users grew significantly over time - the month of November is still in progress.

third iteration pt. 2 - v2

As growth continued, I addressed new scaling challenges by running the Node.js backend in cluster mode with PM2, multiplying server capacity. To manage session persistence across processes, I implemented a load balancer using NGINX with sticky sessions, ensuring users stayed connected to the same process.

This solution is holding up well, and we're prepared to scale horizontally if needed by adding more servers and utilizing the same load-balancing strategy.

Current system design.