Wednesday, May 20, 2009

WebHarvest - An Open source tool for extracting data from websites

Data from many websites can be extracted using different approaches such as using httpwebrequest/response object of, cURL of php and webbrowser control of .net.

In one of our project we have used an open source tool Webharvest for data extraction.

You can see the details of our project yahoo news scrapper in our website.
Basically, our project will collect the news from many publishers website for the specified Stock Ticker and the extracted details will be stored in a file based on published date and time.

Webharvest made the work very easy. If you are familiar with HTML, javascript and XML you can easily learn webharvest in two or three days.

It is written in Java. So we can use it in many different platforms (Windows,Linux) without any issue.

We know that each and every webpage format will be different which makes the data extraction difficult.

In webharvest, we can use xpath and xquery for easily extracting data from many websites using different formats.

Webharvest uses xml based configuration file to specify the extraction processes.
Within this xml config. file we can use javascript code also.

It seems it uses the ECMA based javascript. So the javascript working on Internet Explorer may not run with Webharvest. We may need to slightly change the script to make it work with webharvest.

We know that xpath can be used only with xml documents. So the webharvest is having cleaning processor to automatically convert any HTML page into xml page.

It provides an user-friendly GUI for developing, debugging and testing the code.

