Jonathan Marsh writes about a scraping incident he recently had, and why web services (in any form) are good, and scraping bad. This reminds me of my first and last web scraping experiment. It was sometime between my MSc and joining the WebSphere team, so I place this around 1996 or 1997. The Irish Times (one of their national newspapers) first started publishing its' news on the web.
Now my father-in-law is a hibernophile (someone who is fond of Ireland) and since he lived outside London he couldn't readily get the Irish Times. If family was coming to visit, they were encouraged to buy a copy at Victoria Station and bring it down to them on the train.
Now in those days, he didn't have a PC, let alone an internet connection. Now he is a prolific blogger who puts me to shame. Anyhow, back to scraping. I had a cunning plan. I wrote a perl script that browsed the Irish Times website, scraping the links to the "top 5" new stories. I followed those links, grabbed the story data, and used ghostscript (IIRC) to convert the text to a nicely printed page stored as a TIFF. Then I used tpc.int* which offered a free email-to-fax service to fax the page to his office fax machine. Then I set it up as a cron job to run once a day.
So once a day he received the Irish news digest by fax. Oh yeah I felt smart. Major brownie points with the f-i-l. Serious geek cred, pulling together Internet services and open source to create my first mashup. Until 3 days later, when they changed the page layout, and my script inadvertently picked up a GIF image, wrote that out as hex, converted it to a 107 page fax, and jammed up his office fax machine. Now you see why it was my last attempt at scraping. Long live RSS!
Wednesday, 28 May 2008
WSF/PHP 1.3.0
As Samisa points out, there is a new release of WSF/PHP out. This tool allows you to interact with .NET and J2EE systems using WS-Security, WSRM, WS-Addressing, MTOM.
The main update this time is WSDL2PHP support. Take a look here.
The main update this time is WSDL2PHP support. Take a look here.
Wednesday, 14 May 2008
Tin Whistles
My blog title mentions tin whistles, but I haven't exactly written a lot about them. But I recently bought a fantastic tin whistle and I think the story is interesting.
Tin whistles generally come in two forms: mass produced for less than $10 each, or hand made for $100 and upwards. The mass produced kind are probably the cheapest "real" instrument you can buy. I know a lot of people probably think it isn't a real instrument, but just listen to some of the samples from Mary Bergin's Feadoga Stain album, or Carmel Gunning's Lakes of Sligo - all recorded on a $5 instrument.
Recently however, a new type of whistle has appeared - the third way! Known as the "tweaked" whistle, this is where someone takes a dirt-cheap mass produced whistle and improves it till it sounds great. Then they sell it on This is a really interesting idea. It reminds me of how Apache started - by tweaking the original NCSA web server. Unfortunately tweaking in the virtual world is instantaneously replicable. Tweaking physical things is harder, but Erik at Vargas Whistles (who made my tweaked whistle) at least documents his tweaks very clearly. Worth a read if you are a budding whistlesmith.
Tin whistles generally come in two forms: mass produced for less than $10 each, or hand made for $100 and upwards. The mass produced kind are probably the cheapest "real" instrument you can buy. I know a lot of people probably think it isn't a real instrument, but just listen to some of the samples from Mary Bergin's Feadoga Stain album, or Carmel Gunning's Lakes of Sligo - all recorded on a $5 instrument.
Recently however, a new type of whistle has appeared - the third way! Known as the "tweaked" whistle, this is where someone takes a dirt-cheap mass produced whistle and improves it till it sounds great. Then they sell it on This is a really interesting idea. It reminds me of how Apache started - by tweaking the original NCSA web server. Unfortunately tweaking in the virtual world is instantaneously replicable. Tweaking physical things is harder, but Erik at Vargas Whistles (who made my tweaked whistle) at least documents his tweaks very clearly. Worth a read if you are a budding whistlesmith.
Tuesday, 13 May 2008
Open Source versus Open Standards
There is an ongoing debate about Open Standards and Open Source at Artima. I have considerable experience in both activities. I ran the very first JCP group that was completely open - JSR110 Java APIs for WSDL. We had an open mailing list, an open source reference implementation and an open source TCK. Since then I've chaired the OASIS WSRX TC and also contribute to a number of Apache projects.
For me the core difference between Open Standards and Open Source is this:
Many open standards groups consist of companies who are in strong competition. The aim of the standard is fundamentally to allow them to agree enough conformance to open up the market and grow the potential business through standardization. So usually the result is the minimum agreement required to create the open environment for competition. In my experience, Open Standards groups are not always effective at creating new stuff - instead they excel at tightening up already created stuff.
Open source projects are usually run by like-minded people who want to share the effort of developing code and share the results. The result is usually much more creative and expansive than Open standards.
Of course there are exceptions. I believe for example that the Web and XML standards were built in a very collobarative way. And there are open source projects where the competition between contributors becomes an issue. However, the general landscape is defined by the objectives of the participants. If the fundamental objectives of the participants are open competition, then there will be those outcomes.
So how does this relate to the JCP discussion?
When companies compete, it is in all their best interests to strongly enforce conformance. However much it might be tempting to "get away" with not implementing a strong set of conformance tests, or not having to go through that testing, the fact remains that standards are worthless without conformance, and so the effort in creating them is wasted without conformance.
So its my assertion that creating a strong set of conformance tests is in the best interests of all parties in a standards body. And I believe the best way to create those is to do it as a collaborative effort. Effectively this is the chance for the participants to get creative about competition. Try to create a test case that proves your competitors system is non-conformant. Collaborate to break each other's systems!
Hence creating the test kit should be done as an open source project. And by that I mean open development as well as open publishing of the source. If you believe in my logic you will come to the conclusion that the JCP model is still in need of revision.
For me the core difference between Open Standards and Open Source is this:
- Open Standards enable companies to compete in a structured way
- Open Source projects enable people or companies to collaborate in a structured way
Many open standards groups consist of companies who are in strong competition. The aim of the standard is fundamentally to allow them to agree enough conformance to open up the market and grow the potential business through standardization. So usually the result is the minimum agreement required to create the open environment for competition. In my experience, Open Standards groups are not always effective at creating new stuff - instead they excel at tightening up already created stuff.
Open source projects are usually run by like-minded people who want to share the effort of developing code and share the results. The result is usually much more creative and expansive than Open standards.
Of course there are exceptions. I believe for example that the Web and XML standards were built in a very collobarative way. And there are open source projects where the competition between contributors becomes an issue. However, the general landscape is defined by the objectives of the participants. If the fundamental objectives of the participants are open competition, then there will be those outcomes.
So how does this relate to the JCP discussion?
When companies compete, it is in all their best interests to strongly enforce conformance. However much it might be tempting to "get away" with not implementing a strong set of conformance tests, or not having to go through that testing, the fact remains that standards are worthless without conformance, and so the effort in creating them is wasted without conformance.
So its my assertion that creating a strong set of conformance tests is in the best interests of all parties in a standards body. And I believe the best way to create those is to do it as a collaborative effort. Effectively this is the chance for the participants to get creative about competition. Try to create a test case that proves your competitors system is non-conformant. Collaborate to break each other's systems!
Hence creating the test kit should be done as an open source project. And by that I mean open development as well as open publishing of the source. If you believe in my logic you will come to the conclusion that the JCP model is still in need of revision.
Monday, 12 May 2008
Esper 2.1 ships - with Axiom support
Esper 2.1 has shipped. Esper is a cool project that looks for patterns in streams of events. You can use a SQL-like language (EPL) to create queries against groups of events.
Esper is very useful in electronic trading, fraud detection, RFID processing, etc.
You will notice that I get a bit of credit in the 2.1 release note. I should note that Sanka Samaranyake helped me out with this code.
Esper supports different "event" types, including JavaBean, Map and XML. The existing XML support was built on DOM.
I added support for Streaming XML events via Apache Axiom. Axiom is a tree view built on top of the StAX streaming XML pull parser, and it supports efficient XPath processing via Jaxen. What this means is that you can have an XPath expression at the start of an event, and the parser will execute it without parsing the complete event.
I was motivated to write this for two reasons. Firstly, Axiom is very simple to replace DOM with. So you can add streaming support to any DOM application very quickly. Secondly, Axiom is the message format that Apache Synapse uses, and as you might have seen from previous blog posts, I've been putting Synapse and Esper together for a while now.
Sanka did some performance tests with the pre-release code and it showed some significant performance benefits over the existing DOM parsing. And of course those benefits will be even more pronounced when using Esper and Synapse together - because there is now no need to convert the message to DOM. Look out for an updated Synapse/Esper mediator in the near future.
Finally - congratulations to Thomas and the team at Esper on the release.
Esper is very useful in electronic trading, fraud detection, RFID processing, etc.
You will notice that I get a bit of credit in the 2.1 release note. I should note that Sanka Samaranyake helped me out with this code.
Esper supports different "event" types, including JavaBean, Map and XML. The existing XML support was built on DOM.
I added support for Streaming XML events via Apache Axiom. Axiom is a tree view built on top of the StAX streaming XML pull parser, and it supports efficient XPath processing via Jaxen. What this means is that you can have an XPath expression at the start of an event, and the parser will execute it without parsing the complete event.
I was motivated to write this for two reasons. Firstly, Axiom is very simple to replace DOM with. So you can add streaming support to any DOM application very quickly. Secondly, Axiom is the message format that Apache Synapse uses, and as you might have seen from previous blog posts, I've been putting Synapse and Esper together for a while now.
Sanka did some performance tests with the pre-release code and it showed some significant performance benefits over the existing DOM parsing. And of course those benefits will be even more pronounced when using Esper and Synapse together - because there is now no need to convert the message to DOM. Look out for an updated Synapse/Esper mediator in the near future.
Finally - congratulations to Thomas and the team at Esper on the release.
Saturday, 10 May 2008
Why open standards and interoperability are subtly different
I've been pondering on my experience in Open Standards - especially in the WSRM/WSRX world. In particular I've been thinking about the difference between an Open Standard and an Interoperable standard. It seems to me that there is a general assumption out there that these two things are the same. In an ideal world they would be, but I think there is a gap:
Let me give a simple example from the WSRM testing. In WSRM there is the option for a client to request an acknowledgement. An acknowledgement (ack) is designed to let the client know which messages have been received and which haven't. The ability to request acks is key to the working of the protocol. However, acks can also be sent by the server without a request. And we built in the option to allow the server to send a "nack" as well as an ack. For example, an ack might say "I have received messages 1-5 and 6-100", but the nack can specify: "I'm missing message 6 - send it now".
During the testing we came across the following situation: the client was asking for an acknowledgement and the server was sending a nack. The client needed to understand the whole sequence state (which the ack gives) but couldn't because it was only being told one data point (I don't have message 6). The result was a problem. Now, we could have simply said - this is a problem with the client - it needs to deal with it. However, the reality is that this - while legal behaviour of the spec - was not something the designers envisaged. The fact is that it is important for the client to be able to request a complete ack. So we solved this by improving the spec. If the server is initiating it can send a nack, but when responding it must always send the full ack.
Of course its possible we might have caught this just by reviewing the spec. But we didn't. The original team who wrote the spec didn't think of this. The WSRX committee that reviewed it didn't spot it. None of the 5 implementers of the spec spotted it. It was only at the interop testing that we noticed it. And it was only because we had independent implementations and interoperability testing that we ended up with a complete specification.
- Independent implementations
- Specification completeness
- Testing
Let me give a simple example from the WSRM testing. In WSRM there is the option for a client to request an acknowledgement. An acknowledgement (ack) is designed to let the client know which messages have been received and which haven't. The ability to request acks is key to the working of the protocol. However, acks can also be sent by the server without a request. And we built in the option to allow the server to send a "nack" as well as an ack. For example, an ack might say "I have received messages 1-5 and 6-100", but the nack can specify: "I'm missing message 6 - send it now".
During the testing we came across the following situation: the client was asking for an acknowledgement and the server was sending a nack. The client needed to understand the whole sequence state (which the ack gives) but couldn't because it was only being told one data point (I don't have message 6). The result was a problem. Now, we could have simply said - this is a problem with the client - it needs to deal with it. However, the reality is that this - while legal behaviour of the spec - was not something the designers envisaged. The fact is that it is important for the client to be able to request a complete ack. So we solved this by improving the spec. If the server is initiating it can send a nack, but when responding it must always send the full ack.
Of course its possible we might have caught this just by reviewing the spec. But we didn't. The original team who wrote the spec didn't think of this. The WSRX committee that reviewed it didn't spot it. None of the 5 implementers of the spec spotted it. It was only at the interop testing that we noticed it. And it was only because we had independent implementations and interoperability testing that we ended up with a complete specification.
Friday, 9 May 2008
More Cool - GeoTwittering
If you want to see how incredible our mashup platform is, take a look at this example. Basically, one of our engineers saw TwitterVision. Just 5 hours later and the same idea is running on our open mashup platform. Sign up and start creating your own mashups! Its also pretty compelling viewing too... for some reason I find twitter much more interesting on a map.
Labels:
twitter blog mashup
Thursday, 1 May 2008
Cool!
Its official - WSO2 is cool. Well - at least cool enough for Gartner. In their report "Cool Vendors in Web Technologies, 2008" (pay/subscription) they list WSO2 as one of 5 picks.
Obviously you will need to contact Gartner for the full information, but here is a little taster - talking about our Mashup Server:
Obviously you will need to contact Gartner for the full information, but here is a little taster - talking about our Mashup Server:
"What is cool is the open-source aspect, the support for JavaScript-based mashups that can migrate from server to client, and support for lightweight but enterprise-oriented Web services."
Subscribe to:
Posts (Atom)