Friday, 28 September 2012

Understanding Logging in the Cloud

I recently read an interesting pair of articles about Application Logging in OpenShift. While these are great articles on how to use log4j and Apache Commons Logging, they don't address the cloud logging issue at all.

What is the cloud logging issue?

Suppose I have an application I want to deploy in the cloud. I also want to automatically elastically scale this app. In fact I'm hoping that this app will succeed - and then I'm going to want to deploy it in different geos. I'm using EC2 for starters, but I might need to move it later. Ok, so that sounds a bit YAGNI. Let's cut back the requirements. I'm running my app in the cloud, on a single server in a single geo.

I do not want to log to the local filesystem.

Why not? Well firstly if this is say EC2, then the server might get terminated and I'm going to lose my logs. If it doesn't get restarted then they are going to grow and kill my local filesystem. Either way, I'm in a mess.

I need to log my logs somewhere that is:
1) designed to support getting logs from multiple places - e.g. whichever EC2 or other instance my server happens to be hosted today
2) separate from my worker instance so when that gets stopped and started it lives
3) supports proper log rotation, etc

If I have this then it supports my initial problem, but it actually also supports my bigger requirements around autoscaling and geos.

Stratos is an open source Platform-as-a-Service foundation that we've created at WSO2. In Stratos we had to deal with this early on because we support elastic auto-scaling by default.

In Stratos 1.x we built a model based on syslog-ng. Basically we used log4j for applications to log. So just as any normal log4j logging you would do something like:


Logger  logger = Logger.getLogger("org.fremantle.myApp");
logger.warn("This is a warning");


We automatically setup the log appenders in the Stratos services to use the log4j syslog appender. When we start an instance we automatically set it up under the covers to pipe the syslog output to syslog-ng. Then we automatically collate these logs and make them available.

In Stratos 2.x we have improved this.
The syslog-ng model is not as efficient as we needed, and also we needed a better way of slicing and dicing the resulting log files.

In the Stratos PaaS we also have another key requirement - multi-tenancy. We have lots of instances of servers, some of which are one instance per tenant/domain, and some which are shared between tenants. In both cases we need to split out the logs so that each tenant only sees their own logs.

So in Stratos 2.x (due in the next couple of months) we have a simple Apache Thrift interface (and a JSON/REST one too). We already have a log4j target that pushes to this. So exactly the same code as above works in Stratos 2.x with no changes. 



We are also going to add models for non-Java (e.g. syslog, log4php, etc).

Now what happens next? The local agent on the cloud instance is setup automatically to publish to the local central log server. This takes the logs and publishes them to an Apache Cassandra database. We then run Apache Hive scripts that slice the logs per tenant and per application. These are then available to the user via our web interface and also via simple network calls. Why this model? This is really scalable. I mean really, really scalable. Cassandra can scale to hundreds of nodes, if necessary. Also its really fast. Our benchmarks show that we can write >10k entries/second on a normal server.

Summary

Logging in the cloud isn't just about logging to your local disk. That is not a robust or scalable answer. Logging to the cloud needs a proper cloud logging model. In Stratos we have built one. You can use it from Java today and from Stratos 2.0 we are adding support to publish log entries just with a simple REST interface, or a super-fast highly scalable approach with Apache Thrift.

1 comment: