Data from many websites can be extracted using different approaches such as using httpwebrequest/response object of C#.net, cURL of php and webbrowser control of .net.
In one of our project we have used an open source tool Webharvest for data extraction.
You can see the details of our project yahoo news scrapper in our website.
Basically, our project will collect the news from many publishers website for the specified Stock Ticker and the extracted details will be stored in a file based on published date and time.
Webharvest made the work very easy. If you are familiar with HTML, javascript and XML you can easily learn webharvest in two or three days.
It is written in Java. So we can use it in many different platforms (Windows,Linux) without any issue.
We know that each and every webpage format will be different which makes the data extraction difficult.
In webharvest, we can use xpath and xquery for easily extracting data from many websites using different formats.
Webharvest uses xml based configuration file to specify the extraction processes.
Within this xml config. file we can use javascript code also.
It seems it uses the ECMA based javascript. So the javascript working on Internet Explorer may not run with Webharvest. We may need to slightly change the script to make it work with webharvest.
We know that xpath can be used only with xml documents. So the webharvest is having cleaning processor to automatically convert any HTML page into xml page.
It provides an user-friendly GUI for developing, debugging and testing the code.
More Articles...
Popular Posts
- Software Testing Quiz Questions and Answers
- Javascript Quiz Questions and Answers
- MySQL Quiz Questions and Answers
- SEO Your Blog the Easy Way - Guest Post
- SEO (Search Engine Optimization) Quiz Questions and Answers.
- PHP Quiz Questions and Answers
- General Knowledge Quiz Questions and Answers
- HTML Quiz Questions and Answers
- Basic Computer Hardware Quiz Questions and Answer.
- Best Motivational Quotes Book | Kindle eBook and Paperback | Read for FREE with Amazon's Kindle Unlimited

1 comment:
Interesting points on extracting data, I use python for simple extracting data,data extraction can be a time consuming process
but for larger projects like documents, files, or the web i tried "extracting data" which worked great, they build quick custom screen scrapers, extracting data, and data parsing programs
Post a Comment