I first read about self-directed messages in David A. Chappell’s book Enterprise Service Bus. In it, these types of messages are illustrated using itinerary based routing where the message sender knows what steps the message needs to take to accomplish a complete business process (an instance of the Routing Slip pattern described by Hohpe et al.). At the same time, I was doing some security research on PKI which led me to grid computing. Since then, I’ve been convinced that the key to scaling applications lies in the Internet model of highly-distributed computing. By combining grid computing and the Internet model of scalability with a generalization of itinerary-based routing, you end up with self-directed messages. This concept is the basis of the W3C’s WS Choreography Model.
What’s the Problem?
In real life when you try to do too many things at the same time, a bottleneck develops. We’ve all been there. You have a list of things you’re supposed to do each week that seems to always get longer rather than shorter. The net effect is that the things you’re supposed to be doing on which other people are depending just don’t get turned around in a timely fashion. Therefore, you’re the one everyone is waiting on before they can continue with their own ever-increasing list of tasks.
This same problem is common in software systems. In fact, well-proven design patterns like Model-View-Controller and even the nature of the
main() function in most programming languages can tend to encourage this behavior. The control of what your program is trying to accomplish needs to go somewhere, so it’s only natural to manage that control in a central place. This is why programs start at
main() and why the Controller of MVC exists.
However, the history of software and hardware development has shown that eventually, you’re going to need to do more than one thing at a time. To do that, you need to split what you’re trying to do into parts that can be replicated. This is the reason for multi-CPU hardware, time-sharing operating systems and multi-threading. However, the problem still remains one of control: something needs to direct what happens when.
Who’s in Charge Here?
The traditional answer to handling this control is that something, either the human, a process or a block of code directs all of the other parts of the program. This type of control is called orchestration, because, like a symphony orchestra, the musicians are supposed to follow the conductor. They aren’t supposed to do anything on their own. Orchestration requires that there is central control somewhere, but the activities can then be spread across multiple CPUs, processes or threads.
While this approach helps, there is still a lot of time spent in trying to coordinate the different pieces. The more pieces there are to control, the more effort it takes to control them–picture trying to keep five kittens in the same box. The easy answer to the problem is vertical scaling by adding more resources to allow the controller to manage these additional tasks. This can obviously only be a short-term solution, because it is addressing the symptom of the problem rather than the core problem: how do I get more things accomplished more efficiently.
In software systems, this question is often asked, but no one likes the best answer. The best answer is to re-think about how the solution works, but by this time, there’s normally a lot of time and money invested in building what’s already there. Ripping it apart to make it more efficient is generally not an option, so we’re back to growing more arms so we can keep more kittens in the box.
Lessons from Life
As most of us can’t grow more arms, one of two things generally happens:
- You change the rules of the game (put a lid on the box)
- You ask for help
Most of the time you can’t change the rules, so you end up asking for help. In the “real world” this is called delegation, and it’s a hard thing for people to do–especially for things where they’re very concerned about the outcome. Delegation only works if who or what you delegate to has the autonomy to work independently from you without excessive interruptions or micro-management. They can only do this if you provide them enough information for them to do what you want them to do.
In software systems, the core enabler of this type of delegation building in support for asynchronous interactions. However, the use of asynchronous communication between components also implies event-driven behavior–an example of inversion of control (as explained by Martin Fowler). If you don’t have an event handling protocol, you don’t have an effective way to communicate with those you’re delegating to.
Unfortunately, lots of systems are still implemented based on a synchronous model. While sometimes it is desirable and necessary to use the Request-Reply pattern, it is also an easy way to kill your system performance by trying to apply synchronous thinking to an asynchronous environment. If all you are doing is RPC-style distributed computing, then you aren’t really delegating control anywhere else. You’re still directing the orchestra.
If you can combine delegation and event handling, you can spread the work around and only act when you need to. The controller in your system is no longer the bottleneck, because everyone participating in the system has an assigned task and a protocol or way of communicating that allows them to communicate their results in an efficient manner.
Task description + asynchronous communication + event handling = choreography
In dance, the choreographer is the person who has the big picture in mind. They achieve a desired response from the audience by coordinating the movements of the dancers on the stage. Unlike the orchestra conductor, they don’t participate in the show. They have already done their work by planning it out and holding the rehearsals so that everyone knows their steps. When the curtain goes up, they’re watching from the wings.
Once the choreography is set, it can be executed many times without any interaction from the choreographer. This feature enables horizontal scaling because each instance of the choreography can be executed in parallel without putting any extra demands on the choreographer. By applying the lessons of the Internet and grid computing on how to build agents or processes that can execute these choreographies, and encoding the choreographies into the message, you get massive scalability.
It is important to note that the choreographies embedded within the message are not the process to be executed by the agent. Instead, they simply say what happens based on the outcome of the process. They are really state machines which need only relate a result code from a source to a destination. For example, it could be as simple as the following table specifying the URL for an agent, the response code and next destination:
- http://foobar.com/task1 : HTTP 200 ⇒ http://foobar.com/task2
- http://foobar.com/task1 : HTTP 500 ⇒ http://foobar.com/start
Of course, there needs to be a common processing model of what these things mean by all participants and what is supposed to be sent, etc., but this is the basic idea.
While this approach provides massive scalability because there is no point of central control, it also has several security concerns:
- How do you ensure the integrity of the choreography
- How do you manage the accountability of the choreography
- How do you track the progress of the choreography
These are tough questions, but they need to be sufficiently considered before choreography style services can be deployed. Item #1 can be addressed through the use of digital signatures. Item #2 is a people and organizational issue that needs to be solved outside of any particular technology.
Item #3 has a number of solutions, potentially related to tracking requests for a remote choreography signature, but it is important to not sacrifice the efficiencies gained from using choreographies by sending status messages equivalent to “I’m now going to the next step; Ok, I’m here”. This approach will indirectly introduce all the overhead of an orchestrated service that you were trying to leave behind.
I fully believe that the only way to achieve Internet-scale SOA is to follow the lessons available from implementing the Internet and World-Wide Web. The model is that there is no central point of control, there are agreed communications protocols and data formats, and lots and lots of redundancy exists to provide both reliability and efficiency. Most of these end up being either completely or nearly transparent to the end user, regardless if they’re using
ssh to access a server in Germany or using a Web browser to access Google.
The work being done by the W3C with WS-Choreography is a step in the right direction, but there are a number of hurdles before this model is going to materialize. The first and foremost is the mind shift that needs to take place so that architects and developers start thinking asynchronously. Until that happens, the way services are designed will produce more symphonies than ballets, and the conductors are going to need lots more arms to keep everything on track.