How GoDaddy Implemented a Multi-Region Event-Driven Platform at Scale

GoDaddy, a leading global provider of domain registration and web hosting services, has served over 84 million domains and 22 million customers since its establishment in 1997. Among its various internal systems, the Customer Signal Platform provides tooling to capture, analyze, and act on customer and product data to drive better business outcomes. With this platform, GoDaddy can track user visits and interactions on its website and use meaningful event data to improve its customer experience and overall business performance.

Nowadays, the Customer Signal Platform processes 400 million events every day. As GoDaddy expands its integrations, it aims to increase this number to 2 billion events per day in the near future.

When building the Customer Signal Platform, GoDaddy had three main requirements for the system architecture:

Minimize their operational load.
Scale automatically as traffic changes.
Provide high availability and ensure that all the customer signals are captured.

Amazon EventBridge Event Bus
After evaluating many options against their requirements, GoDaddy decided to implement the customer signal platform using Amazon EventBridge Event Bus. EventBridge Event Bus is a serverless event bus that helps you receive, filter, transform, route, and deliver events. Because EventBridge is serverless, it requires minimal configuration to get started and scales automatically—GoDaddy’s first two requirements were checked.

To comply with the third requirement, the solution needed to provide business continuity and ensure that no event is lost from the moment the client produces it until it gets to the platform to be analyzed. EventBridge Event Bus comes with many features that helped GoDaddy build their application with this requirement in mind.

The main feature that GoDaddy took advantage of was global endpoints. EventBridge global endpoints provide a reliable and simple way to improve the business continuity of event-driven applications. This new feature, added in 2022, allows customers to build a multi-Region event-driven application.

EventBridge Global Endpoints
Global endpoints allow you to configure a managed DNS endpoint in EventBridge, to which your applications will send events. Then you need to configure two custom event buses in two distinct AWS Regions. One is the primary Region, and the other is the failover, or secondary Region. The failover of events is decided based on the health indicated by an Amazon Route 53 health check. When the health check is healthy, the events are routed from the global endpoint to the custom event bus in the primary Region. And if the health check is unhealthy, then the global endpoint will send the events to the event bus in the secondary Region.

The simplest configuration for global endpoints is the active/archive configuration. This configuration provides business continuity and simplicity at the same time. The active/archive configuration defines two different Regions. The primary Region is where the application is deployed and all the business processes are happening. The archive Region is where only a custom bus is deployed and all the events are archived.

In addition, there is a bidirectional replication rule between the buses in separate Regions. In the normal case, when there are no errors, whenever an event arrives at the custom bus in the primary Region, the event is automatically replicated to the archive custom bus in the secondary Region.

In the case of failover, the global endpoint redirects the events to the secondary Region, where they get archived for processing at another time.

GoDaddy Implementation of Global Endpoints
GoDaddy was looking for a solution that minimized their operations load while still providing business continuity, and that is why they adopted global endpoints and the active/archive configuration. In this way, they could have the event processing logic in their primary Region and have a secondary Region in case of any issues.

In their configuration, events are archived in the secondary Region for 30 days, after which the events expire. In the case of a failover, because they don’t need to process the events in real time, they collect them in the archive. If the issue is resolved within 24 hours, the retention period for the replication rule, the events are sent automatically to the primary Region. If the issue is solved in more than 24 hours the events need to be replayed to the primary Region.

The following image shows what their current solution looks like. They are working with two Regions. US West (Oregon) is their primary Region and is the location of the data lake, which is the primary consumer of the events. US East (N. Virginia) is the secondary Region. Events are being produced in different clients; from the clients, they are sent to Amazon API Gateway. GoDaddy deployed two API Gateways in their two Regions. The events are sent to the API Gateway with the smallest latency from the client. To do that, they use latency-based routing provided by Amazon Route 53. Then events are sent to an AWS Lambda function that validates the events and forwards them to the EventBridge global endpoint at the DNS level.

The global endpoint is configured with the active/archive setup, and the failover is configured to be triggered via a Route 53 health check that monitors an Amazon CloudWatch alarm. That alarm observes the IngestionToInvocationStartLatency metric in the primary Region.

IngestionToInvocationStartLatency is a service-level metric that exposes the time to process events from the point at which they are ingested by EventBridge to the point the first invocation of a target in the configured rules is made. This metric is measured across all the rules in your bus and provides an indication of the health of the EventBridge service. Any extended periods of high latency over 30 seconds indicate a service disruption.

When the system is in the normal state, the events are forwarded from the global endpoint to the custom ingress event bus in the primary Region. That custom event bus has replication enabled; this means that all the events that arrive at the bus get replicated automatically in the secondary Region custom ingress event bus.

All the events received by the ingress event bus are sent to the enrichment function. This function performs basic validation and authentication, and it enriches the event data to make sure that all the events from different clients are standard.

From there, the events are forwarded to the data platform event bus to be sent to the different consumer targets. The main target is their data lake solution, which analyzes all the events.

What Was the Impact?
For GoDaddy, business continuity is important, and their customer signals are not getting lost due to any issue with their platform. This makes them confident that they can expand their customer signal platforms from 400 million events per day to 2 billion events per day without introducing any additional operations overhead.

Now, they can confidently process hundreds of millions of events per day to their system, and they can keep on growing. The following image shows the number of events ingested by global endpoints in a normal day.

While GoDaddy’s use of the active/archive pattern enables them to ensure they never lose any events, they’re already starting to see certain use cases where they want to minimize any delays in processing their events, even when service disruptions occur. Because they’re already replicating their events to a secondary Region, they can deploy their most critical consumers to both Regions and enable an active/active configuration for their mission-critical systems. Active/active configuration allows you to process parallel events in both the primary and secondary Regions, simplifying the processing of events even during disruptions and enabling business continuity.

The vision when building the Customer Signal Platform was to align with GoDaddy’s high bar for reliability, scalability, and maintainability and, at the same time, keep the platform self-service so that developers can focus on business needs. This led GoDaddy to choose Amazon EventBridge global endpoints and serverless technologies to build this solution.

GoDaddy Customer Signal Platform is an excellent example of what serverless technologies enable. By leveraging the cloud to handle as much of the undifferentiated heavy lifting as possible, GoDaddy has reduced the operational complexity of setting up an event bus for a multi-Region strategy, implemented failover mechanisms in the case of Regional distruptions, and ensured that events are not lost by enabling replication. Global endpoints active/archive configuration improves the availability of customer applications with the least amount of configuration changes.

If you want to get started with EventBridge global endpoints, you can check out this talk on event-driven applications. For a working demo on how to use EventBridge global endpoints for failover events, check out this Serverless Land repository.

— Marcia

Source