Jonathan Marsh writes about a scraping incident he recently had, and why web services (in any form) are good, and scraping bad. This reminds me of my first and last web scraping experiment. It was sometime between my MSc and joining the WebSphere team, so I place this around 1996 or 1997. The Irish Times (one of their national newspapers) first started publishing its' news on the web.
Now my father-in-law is a hibernophile (someone who is fond of Ireland) and since he lived outside London he couldn't readily get the Irish Times. If family was coming to visit, they were encouraged to buy a copy at Victoria Station and bring it down to them on the train.
Now in those days, he didn't have a PC, let alone an internet connection. Now he is a prolific blogger who puts me to shame. Anyhow, back to scraping. I had a cunning plan. I wrote a perl script that browsed the Irish Times website, scraping the links to the "top 5" new stories. I followed those links, grabbed the story data, and used ghostscript (IIRC) to convert the text to a nicely printed page stored as a TIFF. Then I used tpc.int* which offered a free email-to-fax service to fax the page to his office fax machine. Then I set it up as a cron job to run once a day.
So once a day he received the Irish news digest by fax. Oh yeah I felt smart. Major brownie points with the f-i-l. Serious geek cred, pulling together Internet services and open source to create my first mashup. Until 3 days later, when they changed the page layout, and my script inadvertently picked up a GIF image, wrote that out as hex, converted it to a 107 page fax, and jammed up his office fax machine. Now you see why it was my last attempt at scraping. Long live RSS!
Wednesday, 28 May 2008
Subscribe to:
Post Comments (Atom)
1 comments:
Very Interesting!
Post a Comment