Distributed System Design Patterns
Most software services (i.e. web services) built today follow a similar logical pattern:
- They receive a request from a client.
- They fetch external data from other systems as necessary to fulfill the request.
- They combine this external data with any persistent state the service may maintain to form the request context.
- They compute some function of the request context and somehow output the result, either as a web response or writing to some storage.
This same pattern could describe a class in an object-oriented program, as well as a web API. This pattern is also inherently stateful. Indeed one of the main points of object-oriented programming and service-oriented architectures is that object or system state is hidden behind an API, and the client should not know or care about this hidden state. This pattern is evolving with the increased use of serverless architectures, where explicit handling of state in a persistent external cache is becoming more standard1.
Batch data-processing pipelines typically follow a different pattern, where the system starts with no state and builds up state implicitly by performing a series of stateful operations (e.g. joins and aggregations) on input data. When the pipeline finishes, any state that might be necessary for the next execution is saved externally. System state is not hidden, instead it is explicitly an input and an output of the system.
Continuous streaming applications (like those built using Flink, Kafka Streams, Spark Streaming, etc.) are a hybrid of these patterns. They start with an empty state and build up their state by consuming data, but they don’t explicitly write their state back onto streams; instead state storage is handled by the streaming compute infrastructure. The streaming platform manages operator state and uses it to recover failed nodes or restore the application from a previous state.
These three different types of systems use very different programming models. For services, programmers generally write imperative-style code in an OO language like Java or C#. For batch data pipelines programmers generally use a declarative language like SQL, or a map-reduce framework like Hadoop. Streaming applications use a hybrid approach where the streaming graph is declared, but the operations themselves are implemented using imperative-style functions that directly manipulate operator state.
There are good explanations for why these different types of systems have different programming models.
Services are usually concerned with latency and reliability. They pull in additional information from heterogenous sources like other services or databases, and employ explicit caching mechanisms on top of these external data sources. Fully optimizing latency and reliability in such an environment requires fine control over the internal logic. The types of things that programmers worry about include “making sure my clients still get a response when I have a service outage in one region,” and “ensuring I meet my SLA of 100ms response time 99.9% of the time.”
Batch data pipelines are generally concerned with throughput and scale. Data sources are generally homogeneous (from the point of view of the programmer) and the possible operations are constrained by the language. These constrained operations have been highly optimized over many decades, so the vast majority of programmers don’t need to bother with the implementation details of these operations, only with their behavior. The types of things that programmers worry about include “ensuring I don’t have data skew across my join keys,” and “partitioning and indexing my input data to improve parallelism.”
Streaming applications are concerned with a mix of latency, reliability, throughput, and scale. They generally don’t have to worry about heterogenous data sources within the program itself. While they do have to explicitly manage their internal state, the streaming platform handles recovery and restoration from a previously saved state.
But do all of these different types of systems really require different programming models?
All of the system types above perform the same high-level operations:
- They have an incoming source of items that require some action.
- They pull in additional data required to perform a given action.
- They combine this data with their internal state to create the action context, and perform the action.
- They output the result of the action.
It’s not a huge leap to reason that the behavior of these different types of systems could be described using a single language. One programming paradigm that would work as a descriptor of the behavior of these systems is data-oriented programming2. Data-oriented programming builds on top of the dataflow programming3 concept which has been around for decades, and appears more recently in the programming models for Apache Beam and Apache Flink. The difference between dataflow programming and the data-oriented programming paradigm I am describing is that in dataflow programming data relationships are implemented, whereas in pure data-oriented programming data relationships are declared.
To help explain the difference and why it can be important, I’ll start by introducing Milan.
Milan (https://github.com/amzn/milan) is a data-oriented programming language and runtime infrastructure. Using Milan, programmers build up their program by declaring relationships between data streams. Data streams in Milan simply represent potentially unbounded, somewhat ordered sets of objects; they do not impose any further structure. The programmer does not need to know the source of the data when writing a program, only the type of the objects on the stream.
Milan has three components:
- A general purpose stream algebra that encodes relationships between data streams (the Milan Intermediate Language or Milan IL)
- A Scala library for building programs in that algebra.
- A compiler that takes programs expressed in Milan IL and produces a Flink application that executes the program.
Component (2) can be extended to support interfaces in additional languages, and component (3) can be extended to support additional runtime targets. Considering just the multiple interfaces and the multiple runtimes, Milan looks a lot like the much more mature Apache Beam. The difference lies in (1), Milan’s general purpose stream algebra4.
A Milan program is represented by an intermediate language that is independent from the host language used to create the program, independent from the execution platform, and independent from the physical source of the input data streams. This means that the same Milan program can be executed against true data streams like those from Kafka or Kinesis, against static tables, by listening to web requests, or a combination of all of these. It also means that a Milan program can be manipulated from a language other than the host language used to create it. For example one could write a Milan program in Scala and then modify it later in Python.
The Milan IL representation of a “pure”5 Milan program contains all of the information necessary to understand the exact relationships between data objects being described. This is in contrast to Beam or Flink where the pipeline or graph may tell you that two streams are somehow connected, but the exact semantics of that connection are only described in Java, Python, or Go code.
Furthermore the stream operations in Milan IL are all derived from a few fundamental operations. This makes it theoretically possible to develop algorithms to reason about and prove things about privacy or security of a program in Milan IL, or the provenance of its output data. These proofs can then carry over to the (much messier) runtime compiled version of the same program. Indeed Milan already supports a barebones implementation of automated lineage tracking for any data object it produces.
The big questions are then:
- Is this distinction important in practice?
- Is it possible to create an abstraction like Milan that is actually useful in enough real-world situations?
I think the answer to both questions is yes, but we won’t know unless we try, and that’s what Milan is all about.
The Boda Boda Sample
We provide a few sample applications with Milan, the most full-featured of which is the Boda Boda Taxi App. This application imagines a ride-hailing app for boda boda motorcycle taxis which are prevalent in East Africa6.
The sample implements the logic of:
- Monitoring the location and status/availability of drivers,
- Consuming ride requests,
- Matching ride requests to an available driver,
- Tracking the progress of rides, and
- Computing a simple “waiting time” statistic for each ride.
There are a few fundamental unsolved issues in Milan that are surfaced in this sample, including:
- How to implement atomic resource allocation (in this case allocating drivers to rides) in an asynchronous streaming architecture. This is more straightforward in Flink (and now Beam) because programmers can directly manipulate program state.
- How to represent cycles (such as a driver allocation event being fed back into the driver status handler) in the streaming graph, which is currently a DAG.
The boda boda app will provide a platform for us to explore future directions for the language and infrastructure, such as:
- Hybrid streaming/real-time API applications where some data inputs are streams or tables and others are responding to web requests.
- How to declaratively represent decision-making processes such as the resource allocation problem mentioned above.
- How to represent predictive models (e.g. machine learning) in the same declarative language, so that it’s possible to automatically reason about their inputs and their behavior.
- Automatic differentiation of programs, based on linearization of operator behavior at a given operating point.
AWS Lambda, Flink’s Stateful Functions, and their ilk are essentially service-oriented architecture on steroids. They don’t provide an abstraction for how to compose functions together into an application, nor do they abstract away handling of system state from the programmer. They can be useful as the compilation target of a data-oriented programming language, though.
The term data-oriented programming has been around for a while, but I’m repurposing it here. Previously it has referred to a programming paradigm used in high-performance applications such as games. Functions were written to read and manipulate shared long-lived objects in order to eliminate the overhead of creating many small or short-lived objects.
The stream algebra in the current version of Milan is not final. We are actively working to give it a more consistent theoretical basis, and expect it to evolve somewhat over the coming months.
Milan does allow “impure” programs that call into external library functions, for example from inside a Map or FlatMap operation. These external functions should be stateless, but Milan can’t enforce this.
There’s nothing boda boda-specific in the application itself, we just don’t want to appear like we’re trying to move in on the business of any existing ride-hailing apps that may exist.