Asynchronous task execution at scale

November 12, 2020

Where asynchronous task execution fits into a larger system is something I’ve been pondering for a while. It is an interesting type of problem, which exists in an unusually strong intersection of operability, guarantees, and fundamental system design principles. Much like system design itself, two extremes bookend the spectrum of common solutions.

On one end, we treat asynchronous task execution as a purely local concern. In this model, every framework, subdomain, or even service implements its own asynchronous task execution that is just a part of how the subsystem runs. Operations equally become a local concern.

On the other end, asynchronous execution is a fundamental property of the entire system. We tightly integrate frameworks and use them as first-tier hosts of the system. If that description is too abstract, Celery is the Python world’s canonical example. This design is familiar to most as the de facto approach in most monoliths¹. While tightly coupled, it does offer some benefits as compared with the localized design. Most prominently, it makes centralizing operations and tooling natural².

In between these two extremes lies a myriad of possible designs. In reality, though, trade-offs become very nuanced. This work is therefore rarely released, and little has been published about this topic. That only makes it altogether more exciting that Dropbox yesterday published a design paper on their solution to this problem at scale. More exciting still, they went with my current personal local optima: a loosely coupled service that relies on well-defined boundaries rather than implementations to execute tasks as if they were any other type of invocation.

This design beautifully centralizes operations of scheduling and invocation while leaving decisions about and operations of execution localized — in other words, localizing operational expertise and system impact. Specific implementation details, such as the polling-heavy design, clearly result from a particular set of trade-offs. But, at a high level, I genuinely believe this design offers a clear improvement over most approaches I’ve seen in the wild, namely by separating concerns, both technically and organizationally. While nuances remain nuances, it would be truly exciting to see this shape of solution generalized well and open-sourced as a component in our shared toolbox.

Regardless, it’s incredibly exciting to see companies publish their solutions to very big design problems.

At times, this is the implementation approach in a system where asynchronous task execution is solved locally. The big difference is obviously that the implementation is not centralized. When blowing out a monolith, this is a piece that is often kept stable and just localized instead. Dozens of Celery clusters are a special kind of scary to me. ↩
Centralized along with everything else, of course. ↩