Tuesday, 18 September 2012

Understanding ESB Performance & Benchmarking

ESB performance is a hot (and disputed topic). In this post I don't want to talk about different vendors or different benchmarks. I'm simply trying to help people understand some of the general aspects of benchmarking ESBs and what to look out for in the results.

The general ESB model is that you have some service consumer, an ESB in the middle and a service provider (target service) that the ESB is calling. To benchmark this, you usually have a load driver client, an ESB, and a dummy service.

+-------------+      +---------+      +---------------+
| Load Driver |------|   ESB   |------| Dummy Service |
+-------------+      +---------+      +---------------+

Firstly, we want the Load Driver (LD), the ESB and the Dummy Service (DS) to be on different hardware. Why? Because we want to understand the ESB performance, not the performance of the DS or LD.

The second thing to be aware of is that the performance results are completely dependent on the hardware, memory, network, etc used. So never compare different results from different hardware.

Now there are three things we could look at:
A) Same LD, same DS, different vendors ESBs doing the same thing (e.g. content-based routing)
B) Same LD, same DS, different ESB configs for the same ESB, doing different things (e.g. static routing vs content-based routing)
C) Going via ESB compared to going Direct (e.g. LD--->DS without ESB)

Each of these provides useful data but each also needs to be understood.

Metrics
Before looking at the scenarios, lets look at how to measure the performance. The two metrics that are always a starting point in any benchmark of an ESB here are the throughput (requests/second) and the latency (how long each request takes). With latency we can consider overall latency - the time taken for a completed request observed at the LD, and the ESB latency, which is the time taken by the message in the ESB. The ESB latency can be hard to work out. A well designed ESB will already be sending bytes to the DS before its finished reading the bytes the LD has sent it. This is called pipelining. Some ESBs attempt to measure the ESB latency inside the ESB using clever calculations. Alternatively scenario C (comparing via ESB vs Direct) can give an idea of ESB Latency. 

But before we look at the metrics we need to understand the load driver.

There are two different models to doing Load Driving:
1) Do a realistic load test based on your requirements. For example if you know you want to support up to 50 concurrent clients each making a call every 5 seconds on average, you can simulate this.
2) Saturation! Have a large number of clients, each making a call as soon as the last one finishes.

The first one is aimed at testing what the ESB does before its fully CPU loaded. In other words, if you are looking to see the effect of adding an ESB, or the comparison of one ESB to another under realistic load, then #1 is the right approach. In this approach, looking at throughput may not be useful, because all the different approaches have similar results. If I'm only putting in 300 requests a sec on a modern system, I'm likely to see 300 request a sec. Nothing exciting. But the latency is revealing here. If one ESB responds in less time than another ESB thats a very good sign, because with the same DS the average time per request is very telling.

On the other hand the saturation test is where the throughput is interesting. Before you look at the throughput though, check three things:
1) Is the LD CPU running close to 100%?
2) Is the DS CPU running close to 100%?
3) Is the network bandwidth running close to 100%?

If any of these are true, you aren't doing a good test of the ESB throughput. Because if you are looking at throughput then you want the ESB to be the bottleneck. If something else is the bottleneck then the ESB is not providing its max throughput and you aren't giving it a fair chance. For this reason, most benchmarks use a very very lightweight LD or a clustered LD, and similarly use a DS that is superfast and not a realistic DS. Sometimes the DS is coded to do some real work or sleep the thread while its executing to provide a more realistic load test. In this case you probably want to look at latency more than throughput.

Finally you are looking to see a particular behaviour for throughput testing as you increase load.
Throughput vs Load
The shape of this graph shows an ideal scenario. As the LD puts more work through the ESB it responds linearly. At some point the CPU of the ESB hits maximum, and then the throughput stabilizes.  What we don't want to see is the line drooping at the far right. That would mean that the ESB is crumpling under the extra load, and its failing to manage the extra load effectively. This is like the office worker whose efficiency increases as you give them more work but eventually they start spending all their time re-organizing their todo lists and less work overall gets done.

Under the saturation test you really want to see the CPU of the ESB close to 100% utilised. Why? This is a sign that its doing as much as possible. Why would it not be 100%? Two reasons: I/O, multi-processing and thread locks: either the network card or disk or other I/O is holding it up, the code is not efficiently using the available cores, or there are thread contention issues.

Finally its worth noting that you expect the latency to increase a lot under the saturation test. A classic result is this: I do static routing for different size messages with 100 clients LD. For message sizes up to 100k maybe I see a constant 2ms overhead for using the ESB. Suddenly as the message size grows from 100k to 200k I see the overhead growing in proportion to the message size.


Is this such a bad thing? No, in fact this is what you would expect. Before 100K message size, the ESB is underloaded. The straight line up to this point is a great sign that the ESB is pipelining properly. Once the CPU becomes loaded, each request is taking longer because its being made to wait its turn at the ESB while the ESB deals with the increased load.

A big hint here: When you look at this graph, the most interesting latency numbers occur before the CPU is fully loaded. The latency after the CPU is fully loaded is not that interesting, because its simply a function of the number of queued requests.

Now we understand the metrics, lets look at the actual scenarios.

A. Different Vendors, Same Workload
For the first comparison (different vendors) the first thing to be careful of is that the scenario is implemented in the best way possible in each ESB. There are usually a number of ways of implementing the same scenario. For example the same ESB may offer two different HTTP transports (or more!). For example blocking vs non-blocking, servlet vs library, etc. There may be an optimum approach and its worth reading the docs and talking to the vendor to understand the performance tradeoffs of each approach.

Another thing to be careful of in this scenario is the tuning parameters. Each ESB has various tuning aspects that may affect the performance depending on the available hardware. For example, setting the number of threads and memory based on the number of cores and physical memory may make a big difference.

Once you have your results, assuming everything we've already looked at is tickety-boo, then both latency and throughput are interesting and valid comparisons here. 

B. Different Workloads, Same Vendor
What this is measuring is what it costs you to do different activities with the same ESB. For example, doing a static routing is likely to be faster than a content-based routing, which in turn is faster than a transformation. The data from this tells you the cost of doing different functions with the ESB. For example you might want to do a security authentication/authorization check. You should see a constant bump in latency for the security check, irrespective of message size. But if you were doing complex transformation, you would expect to see higher latency for larger messages, because they take more time to transform. 

C. Direct vs ESB
This is an interesting one. Usually this is done for a simple static routing/passthrough scenario. In other words, we are testing the ESB doing its minimum possible. Why bother? Well there are two different reasons. Firstly ESB vendors usually do this for their own benefit as a baseline test. In other words, once you understand the passthrough performance you can then see the cost of doing more work (e.g. logging a header, validating security, transforming the message). 

Remember the two testing methodologies (realistic load vs saturation)? You will see very very different results in each for this, and the data may seem surprising. For the realistic test, remember we want to look at latency. This is a good comparison for the ESB. How much extra time is spent going through the ESB per request under normal conditions. For example, if the average request to the backend takes 18ms and the average request via the ESB takes 19ms, we have an average ESB latency of 1ms. This is a good result - the client is not going to notice much difference - less than 5% extra. 

The saturation test here is a good test to compare different ESBs. For example, suppose I can get 5000 reqs/sec direct. Via ESB_A the number is 3000 reqs/sec and via ESB_B the number is 2000 reqs/sec, I can say that ESB_A is providing better throughput than ESB_B. 

What is not  a good metric here is comparing throughput in saturation mode for direct vs ESB. 


Why not? The reason here is a little complex to explain. Remember how we coded DS to be as fast as possible so as not to be a bottleneck? So what is DS doing? Its really just reading bytes and sending bytes as fast as it can. Assuming the DS code is written efficiently using something really fast (e.g. just a servlet), what this is testing is how fast the hardware (CPU plus Network Card) can read and write through user space in the operating system. On a modern server hardware box you might get a very high number of transactions/sec. Maybe 5000req/s with each message in and out being 1k in size.

So we have 1k in and 1k out = 2k IO.
2k IO x 5000 reqs/sec x 8bits gives us the total network bandwidth of 80Mbits/sec (excluding ethernet headers and overhead).

Now lets look at the ESB. Imagine it can handle 100% of the direct load. There is no slowdown in throughput for the ESB. For each request it has to read the message in from LD and send it out to DS. Even if its doing this in pipelining mode, there is still a CPU cost and an IO cost for this. So the ESB latency of the ESB maybe 1ms, but the CPU and IO cost is much higher. Now, for each response it also has to read it in from DS and write it out to LD. So if the DS is doing 80Mbits/second, the ESB must be doing 160Mbits/second. 

Here is a picture.

Now if the LD is good enough, it will have loaded the DS to the max. CPU or IO capacity or both will be maxed out. Suppose the ESB is running on the same hardware platform as the DS. If the DS machine can do 80Mbit/s flat out, there is no way that the same hardware running as an ESB can do 160Mbit/s! In fact, if the ESB and DS code are both as efficient as possible, then the throughput via ESB will always be 50% of the throughput direct to the DS. Now there is a possible way for the ESB to do better: it can be better coded than the DS. For example, if the ESB did transfers in kernel space instead of user space then it might make a difference. The real answer here is to look at the latency. What is the overhead of adding the ESB to each request. If the ESB latency is small, then we can solve this problem by clustering the ESB. In this case we would put two ESBs in and then get back to full throughput.

The real point of this discussion is that this is not a useful comparison. In reality backend target services are usually pretty slow. If the same dual core server is actually doing some real work - e.g. database lookups, calculations, business logic - then its much more likely to be doing 500 requests a second or even less. 

The following chart shows real data to demonstrate this. The X-Axis shows increasing complexity of work at the backend (DS). As the effort taken by the backend becomes more realistic, the loss in throughput of having an ESB in the way reduces. So with a blindingly fast backend, we see the ESB struggling to provide just 55% of the throughput of the direct case. But as the backend becomes more realistic, we see much better numbers. So at 2000 requests a second there is barely a difference (around 10% reduction in throughput). 


In real life, what we actually see is that often you have many fewer ESBs than backend servers. For example, if we took the scenario of a backend server that can handle 500 reqs/sec, then we might end up with a cluster of two ESBs handling a cluster of 8 backends. 

Conclusion
I hope this blog has given a good overview of ESB performance and benchmarking. In particular, when is a good idea to look at latency and when to use throughput. 


3 comments:

  1. Good post. Do you know of any real-world case studies that demonstrate the use of ESB in latency-sensitive scenarious such as trading? I think people in general are comfortable that an ESB should be able to scale fairly linearly, and hence handle massive throughput requirements, but are worried that all the abstraction an ESB adds will introduced unacceptable latency overhead.

    Robert
    http://soa-probe.com

    ReplyDelete
  2. Something else you've not really talked about is what the ESB is doing in the test scenario. Showing that an ESB performs well during a trivial orchestration proves very little imho. The key is taking a real-world orchestration that involves some mediation, transformation, internal messaging, etc. Ideally this would be the equivalent of an existing (tightly-coupled, non ESB) real scenario that the ESB is hoping to replace.

    ReplyDelete