Data from many websites can be extracted using different approaches such as using httpwebrequest/response object of C#.net, cURL of php and webbrowser control of .net.
In one of our project we have used an open source tool Webharvest for data extraction.
You can see the details of our project yahoo news scrapper in our website.
Basically, our project will collect the news from many publishers website for the specified Stock Ticker and the extracted details will be stored in a file based on published date and time.
It is written in Java. So we can use it in many different platforms (Windows,Linux) without any issue.
We know that each and every webpage format will be different which makes the data extraction difficult.
In webharvest, we can use xpath and xquery for easily extracting data from many websites using different formats.
Webharvest uses xml based configuration file to specify the extraction processes.
We know that xpath can be used only with xml documents. So the webharvest is having cleaning processor to automatically convert any HTML page into xml page.
It provides an user-friendly GUI for developing, debugging and testing the code.