Data from many websites can be extracted using different approaches such as using httpwebrequest/response object of C#.net, cURL of php and webbrowser control of .net.
In one of our project we have used an open source tool Webharvest for data extraction.
You can see the details of our project yahoo news scrapper in our website.
Basically, our project will collect the news from many publishers website for the specified Stock Ticker and the extracted details will be stored in a file based on published date and time.
Webharvest made the work very easy. If you are familiar with HTML, javascript and XML you can easily learn webharvest in two or three days.
It is written in Java. So we can use it in many different platforms (Windows,Linux) without any issue.
We know that each and every webpage format will be different which makes the data extraction difficult.
In webharvest, we can use xpath and xquery for easily extracting data from many websites using different formats.
Webharvest uses xml based configuration file to specify the extraction processes.
Within this xml config. file we can use javascript code also.
It seems it uses the ECMA based javascript. So the javascript working on Internet Explorer may not run with Webharvest. We may need to slightly change the script to make it work with webharvest.
We know that xpath can be used only with xml documents. So the webharvest is having cleaning processor to automatically convert any HTML page into xml page.
It provides an user-friendly GUI for developing, debugging and testing the code.
More Articles...
Search This Blog
Blog Archive
-
▼
2009
(257)
-
▼
May
(42)
- Google Wave - A forward step in Email Communication
- Will Google's Knol replace Wikipedia?
- Understanding Test Settings and Options settings i...
- Will IIS on Windows XP support accessing website f...
- Deleting taskmger created by Virus
- Handling iframe background color issue in IE
- How to keep Track of Changes in Excel?
- Showing Post Title at first in the Title bar for B...
- Formatting Code automatically in Blog Posts
- WebHarvest - An Open source tool for extracting da...
- Resolving inconsistent behaviour of createElement ...
- Will Prefetch (.pf) files improve Windows performa...
- Tips and Tricks for doing AdHoc Testing
- Showing Loading-Indication when loading HTML webpage
- Sharing and Protecting Excel Sheet.
- Linkword - An Effective way of learning Foreign La...
- How to see Javascript error messages in FireFox?
- Use of Application.DoEvents method in C# Application.
- Scheduling Script Execution and Continuously runni...
- Understanding Action Iteration and Test Iteration ...
- About Metatag in HTML webpage
- QTP automation for testing .Net applications devel...
- Expanding/Collapsing a list using javascript
- What is FavIcon and how to create them?
- Javascript code for changing list box (select tag)...
- Dynamically adding text box in web page using java...
- NaN in Javascript and math constants
- Code for putting Flash File within HTML web page
- Handling single quotes in PHP and Javascript
- Use cURL in case allow_url_fopen is off in PHP
- What is Inbound/Back Link and how to find them?
- Name for the performance/Load Test done after modi...
- Restricting user access based on IP Address
- Writing Good Test Cases and Finding Bugs effectively
- array_slice function in php
- When to use $_REQUEST in PHP?
- Work around for sub query issue in older version o...
- All Links - QA, QTP, Web Development, C#.Net , Exc...
- Handling Passwords in QTP Scripts
- URL is case sensitive in Linux/unix server
- Required Steps/Processes in QTP automation
- Best practices in QTP Automation.
-
▼
May
(42)

AI Course | Bundle Offer | Unlocking AI | Dream Big | Listen to Dream Big
Today's Deals | Timesheet | Products | SQL ebook | Earn 50% commission
About | Privacy | Follow | TOS | WhatsApp | Contact
I may earn a commission from Amazon affiliate links
No comments:
Post a Comment