Monday, February 23, 2015

Edge80 Automated Scaling

A central feature of Edge80 is its ability to scale rapidly to meet demand.

Edge80 servers run in data centers around the world. Traffic to Edge80 proxied websites is directed by latency-based routing to the nearest regional data center.

What happens when a website experiences rapidly rising demand, perhaps due to an advertising campaign reaching many people in a short time-frame, special events, or just the start of peak periods in that region? Edge80 protects the original website and reduces latency at the same time by caching popular resources at the edge.

How do we prevent our own servers from becoming overloaded?

Basically, we must detect rising demand and launch more servers to handle it.


Who or what is responsible for scaling?

Alongside each Edge80 server we run a fairly general-purpose scheduler called the 'governor'. It periodically runs 'tasks' (implemented by dynamically imported Python classes).

Could the governor have been implemented as a collection of 'cron' jobs? Perhaps, but one advantage of governor tasks is that they can share state (without complications such as constantly reading and writing to disk) and they can invoke each other (without having to check PID files to avoid running the same task more than once in parallel). Ad-hoc scheduling is also easier than it would be with cron: some tasks (such as doing a controlled shut-down) have multiple phases and timeouts.

One role fulfilled by the governor is to enable certain administrative tasks. An operator can remotely trigger a controlled shut-down. However, as a general philosophy,

Edge80 attempts to minimize centralized control. 

Yes, on some level, there needs to be a big-red-button to say Stop! (or Go!) but the more that day-to-day decisions such as scaling to match load can be taken under automation, the less work there is for an operator. And what, after all, is the point of using computers if not to automate things?

Automated scaling decisions could still be taken centrally but there is a clear advantage to decentralizing them: any central point of control would be a single point of failure, due to loss of connectivity, host failure, programmer error, etc. So the governor, running on each instance, is where we want to put the control.

We don't really want to control the governor. We want it to make decisions based on rules which are easy to modify and even, ultimately, give it machine learning capabilities.

One nice idea in the world of knowledge based systems is the Blackboard System. The governor's 'state' is nothing more than a python dictionary shared between all the tasks. Tasks don't all need to have actions (like shutting down an instance, launching one, archiving logs, etc.) Some tasks can simply check for certain patterns in the state and add new entries if they are found.

Using the state dictionary as the means of communication between tasks also makes things very testable: complex rules can be implemented in tasks without actions, and tasks with actions, which might be highly inconvenient to test, can be simplified to the point where they hardly ever need to be changed.

How do we scale?

At present Edge80 runs in AWS. Scaling down simply means doing a controlled shutdown of one or more instances. Scaling up is done by way of AWS API calls through the boto Python package. When we scale up, we launch a new instance from the AMI of the current instance keeping most parameters the same. Essentially, this means the server replicates itself to better handle demand.

Once you can create and destroy virtual hosts from code you start to see that a virtual host is really just another object. The virtual host as object perspective is interesting. One tends to think of hosts as things that need maintenance. Configuration management systems are great because they turn that maintenance into code and (simple) data and allow you to apply all kinds of useful simplifying abstractions. But

nobody 'maintains' objects in a programming language!

In fact encapsulation is one of the key properties of object-orientation and it has an important contribution to make to integrity and thus security. Likewise, virtual hosts create security problems for us when we keep them open for maintenance. Security is best served by opening only the ports required to do our work and closing the rest, including SSH.

Creating and destroying an entire multi-user operating system is really very inefficient when all we really want are a few processes with network and a bit of disk space. There are many signs now that folks are aware of this issue and would like to change it.

The cloud is a kind of super-computer providing CPU, memory, storage and networking. Hypervisor technology and cloud APIs are where the capabilities and permissions that matter most now reside. Perhaps unikernels will soon take over from processes, and hopefully they will be complemented by flexible yet rigorous security models. I'm a fan of the so-called capability model but there seems to be progress in security now on many fronts.

When do we scale?

The obvious answer, using CPU thresholds, has some shortcomings. For one thing, any bug which causes a busy loop should be dealt with by killing the process, not by spawning a new instance. Some things also cause temporary high CPU usage (scheduled tasks) and these should not necessarily be a cause for scaling up either. If high load is caused by a DOS attack we need to apply a different remedy such as blocking certain IPs or request types. This is where the flexible rule-based approach we have taken comes into its own.

Edge80 use a Twisted reactor to process incoming requests. Each request or timeout causes the reactor's event loop to run. Consequently, with a little instrumentation, we are able to get accurate stats on how much time the reactor loop is busy versus how much time it is idle waiting for something to happen.

We graph and collect statistics on a whole bunch of different things, minimum request times, average request times, IO, CPU load, etc. but currently our key statistic for scaling decisions is the reactor percentage busy. Experiments, and a little bit of queueing theory, suggest that running the reactor too close to 100% busy will result in dramatic increases in the time to process requests.

We need to anticipate the need to scale before the situation is critical.

Scaling is done by region, because website traffic varies independently by region. Each governor instance determines how many other instances are running in the same region. To prevent the launching of too many instances in the event of a DOS attack, we set a reasonable upper limit on the number of instances by region. As our confidence in detecting DOS attacks does too we will be able to increase that limit or perhaps remove it.

Decentralised scaling implies that we must keep at least one instance running in each region or there would be nothing to start from. When scaling down we always scale down the longest-running instance and wait a while before scaling down again: it is important not to accidentally shut down the last instance in a region! A deterministic rule to identify which server to scale down next allows the servers in a region to effectively coordinate their decisions without the overheads of direct coordination.

Currently we use rules like "scale up when busy time goes over 50% for 5 minutes, or over 65% for a minute". Scaling up is spaced out a little to ensure that all servers are properly sharing load before another instance is spawned.

We monitor all the vital statistics to ensure that this is sufficient to meet both normal and unusual demand and still guarantee good response times for all end users.