Engineering for Regional Resilience

Written by Zach Probst and published on March 27th, 2021

At Resilient Vitality, we believe in resilience. Not just in resilience of mind and body, but resilience of Software Architecture. Read more about the design of our services and how we keep Resilient Vitality services up in our first tech blog.

One is None, Two is One

Most people have heard some variation on the expression "One is None, Two is one". The idea is simple. Things break, get lost, or degrade. Some of those things are not easy to replace or cannot be out of commision for a long time. So what do you do? Have another one. Have you ever taken two or more pens to a meeting? Ever take a spare shirt to a BBQ? It's the same thing with software.

In modern cloud based applications, applications are deployed into logical cloud "regions" that represent one or more data centers located in the same approximate geographic area. Here at resilient vitality, we use Amazon Web Services and its us-east-1 and us-west-2 regions.

Front End Experiences

We deliver our front end assets and compute "At Edge". This means that we use Amazon's CloudFront Content Delivery Network and Lambda @ Edge technologies to leverage more than 100 points of presence around the world. Whenever a request is made to resilientvitality.com, the request is routed to the closest possible location for that user. From there we deliver them the images, styles, and text that make up this very webpage.

Routing Backend Traffic

Those front end experiences need to get data from the Resilient Vitality api to power our products. Here is where you run into a lot of issues when it comes to making sure that user data is always accessible. What happens if one region goes down? If your application is only in one region, you have nowhere to go. You put all your eggs (or bits in this case) in one basket. So what's the solution? Multiple baskets of course.

You might think to yourself, why can't you just do what you did before and have everything distributed to 100+ locations around the world? It's because the data that we deliver from those locations such as images and text are relatively static. If you refresh the Resilient vitality homepage 100 times it's going to look the same for every one of those requests. However, If you were to refresh a Vela Community or Carina Practice Dashboard 100 times, chances are people posted or changed stuff that makes those pages different for each request. It's not possible to have all of that data replicated to 100s of locations.

So, its seems like we are pretty much stuck right? Well, 100s were too many but not 1s. As we mentioned before, we run our services out of two AWS regions. us-east-1 and us-west-2. What happens when a request comes in, it goes to our backend in the fastest region to you. Remember that 100+ points of presence we talked about before? Well we measure the time it takes to make a trip to that backend region from that point of presence. When a user makes a query against our backend, we use that information to route them to the fastest region possible.

You might ask, so what happens if one region is down? That was the whole point after all. Extra fast speeds are nice but it's not really what we are talking about here. And you are right. A region can go down for multiple reasons. First, a deployment for one component (or service) of our backend could have gone wrong. Another, although rare, is that AWS could be having a bad day. Either way, if our backend is having trouble in one region or another, we can just take it out of consideration from the procedure we just spoke about. So the description of what happens evolves from "we use that information to route them to the fastest region possible" to "we use that information to route them to the fastest healthy region possible.

One final piece of the puzzle. How do we get the data from one region to the other in the event of a failure? The answer is simple in concept and quite tricky in practice. So we'll leave the details for another time. However, we constantly replicate data from one region to the other. In the event of a regional failure we have all of the data ready to go in one region. This same behavior handles another tricky problem. As you might have guessed us-east-1 and us-west-2 are on the east and west coasts respectively. So what happens if you are in the middle of the country? Couldn't you kinda bounce between one region or the other? And the answer is probably yes. So having the data replicated makes it so that is handled.