Tuesday, 16 September 2008

An interesting problem in Event Driven Architecture

I spent last week hacking on Event Driven Architecture with some very smart people. We came across a problem, and I thought maybe somebody has already seen this and solved it! Comments are welcome.

We were using the following pattern. We have n existing systems. Lets call them L1...Ln. Each of these systems is basically not "touchable" - we can only call the existing interfaces. Each system offers something similar to the following interface:
* update(resource r)
which allows you to update some object, resource or thing. 

We plan to use this interface to update all the systems using events to keep them in sync. Each system can subscribe to updates and keep in sync. 

This works fine if the model is master-slave. In other words, there is a single master system that publishes events to all the slaves. But in reality, that isn't how most architectures work. Each system may have updates which need to be propogated to the others.

To make this work, each system allows you to be notified when any updates happen The problem is that you cannot distinguish the root cause of the update. Some updates come from within the existing system, and of course some updates come through the external interface. 

So imagine I make a change to Resource R1 in system L1. An event (E1) is issued saying there is an update to R1. This is distributed to L2 and L3. These systems in turn update their core storage in response to event E1. Now, these systems issue events (E2 and E3) that reflect the updates to those systems. Ideally we would junk these updates since they are just "echoes" of E1. Without doing that, the system will start to resonate, and feedback will takeover, just like in a badly setup PA.

Spotting the echoes is actually pretty hard to do: we cannot distinguish between these echoes and E1. 

Ideally we would have a correlation between E1 and E2, but since the existing system can't be changed, we can't do this.

Suppose we try to do by comparing E1 and E2. The problem is there may be different data in E1 and E2. Maybe system L2 adds an "last updated" timestamp. This will mean E2 is slightly different to E1. 

What about using timing. Just drop all events that come back referencing R1 for the next 10 seconds after E1 is delivered? What if someone edits the resource inside L2 and there is a genuine new event (E4) as well as the echo? The real event gets dropped.

Is there any solution? The only one we found is that we have a "Master System" (M). All updates from L1...Ln must go through M. So L2 only hears about L1's events via M. And M keeps track of the master data, and M knows what information is important and what is transient (like the aforementioned Timestamp). 

Every time an update comes to the Master, it checks against its own store. If there is a genuine update (the important data for the resource has changed) then it distributes it to everyone (except maybe the source of the update). If the important data matches the existing state, it drops the message. Think of it as a feedback supressor.

So in our case, E1 would get past (genuine update), but E2 and E3 would be dropped. But E4 would be distributed because it contains real updates to the state of R1.

This is a good pattern. It means we always can get the master state. But its also considerably more complex with more moving parts than the standard view of an Event Driven Architecture. You need to have twice as many topics for example (because you have to have one topic for messages to flow up to the Master and a separate topic for messages to flow out back to the Ln systems.

Secondly we need a complete copy of the state (or at least a cache of recent state) PLUS we need to know the shape of the data - what is important and what is transient.

All in all, this is an interesting issue when you try to bridge between a legacy system and an event driven architecture. If anyone has a better solution, please let me know!

5 comments:

  1. Indeed an interesting problem. The solution with the master immediately makes me think of an MDM solution. Where one can also detect/resolve parallel changes and manage similar resources with different key identifiers and/or attributes.

    ReplyDelete
  2. At first glance, I was like 'the distributed commit problem without a centralized coordinator has been solved many years ago'. But when I read it further, it made me realize the shape of the problem here is quite different with the assumptions of legacy systems and "untouchable" interface.

    You are making an assumption for even master-slave based approach to work, aren't you? The solution assumes that the legacy L1,..,Ln can distinguish that a message comes from M and need not echo back to M once updated - otherwise M will keep on getting duplicates. If we can make this assumption, that is knowing the origin of the update, can't L1,..Ln take advantage of this fact and prevent echoing back even in the distributed approach? am I missing something here?

    ReplyDelete
  3. Assuming even the subscription logic is also untouchable, I do not see any solution but having a central switch as you have explained.

    You are talking about using timing to drop echoed events?. I am not clear where that piece of logic would fit in if it were feasible.

    I wonder if it is possible to have a event proxying mechanism where an event proxy is deployed in front of each legacy system as oppose to having a central switch (Master System) so that most of the issues in using a central switch could be solved.

    ReplyDelete
  4. Guy, Nabeel, Danushka, thanks for the feedback. I'm going to try and answer your questions here.

    Guy - effectively the model we came up with is an MDM solution. I was surprised we needed to do that. We are using the MDM system to manage the key identifiers, attributes, etc.

    Nabeel, to solve the simple echo problem, we are not assuming that the L1 legacy system will not echo. Basically it is up to M to notice the echo and drop it. We cannot stop echoes because the underlying system will always produce them. We can stop echoes the other way - if L1 produces a message, then M (which we wrote fresh) is smart enough not to send that update back to L1.

    Danushka, the timing option was just something we considered and discarded. It doesn't work.

    As regards putting a proxy in front of each system to drop the echoes, we did consider this. The problem is that there are lots of different events in the system. In fact I also glossed over the fact that each system has its own message format and we are doing transformation. The result of all this is that to build each proxy, you have to encode a lot of information about which data fields it must use to identify the echoes and which ones it can ignore. Also each proxy has to cache messages or even store the data. So in fact this becomes a less-efficient and more complex solution than having a single central Master M.

    ReplyDelete
  5. Interesting problem, but it would be tempting to say that with the assumptions and constraints in place (data sync problem, can't touch the legacy systems, no way to differentiate root cause from echo), an event-driven architecture is the wrong choice.

    (Not to mention that not having a "system of record" is probably a bad idea anyway!)

    EDA by its nature provides completely decoupled communications between entities, so if you're attempting to put state into the equation (e.g. the master or proxy model) then you're starting to leave EDA and get into SOA.

    OK, onto my ideas...

    Assuming there is no reason to send out echos at all - that is, the resource being synchronised is pretty much the same in all systems, and it isn't a chain or hierarchy of related resources being updated - then the problem should be solvable. Otherwise I can't see how you can solve it without a master to control those relationships.

    Anyway, going down the "no echo required" path, if they're legacy systems then you presumably have some sort of adapter or wrapper to convert legacy notifications into EDA.
    - For outgoing "update" events, they would just get sent to all the subscribers.
    - For incoming "update" events, the adapter would have to do a *synchronous* call to the "update" function, then pick up the expected event that comes back out and discard it.

    If someone is updating one of the other systems at the same time, however, you could potentially get two events coming out - one root and one echo. But concurrent root updates across systems makes this whole problem gets a whole lot worse anyway!

    I'm still tempted to say that if the answer seems too complex or difficult, perhaps the question is wrong...

    ReplyDelete