Data from many websites can be extracted using different approaches such as using httpwebrequest/response object of C#.net, cURL of php and webbrowser control of .net.
In one of our project we have used an open source tool Webharvest for data extraction.
You can see the details of our project yahoo news scrapper in our website.
Basically, our project will collect the news from many publishers website for the specified Stock Ticker and the extracted details will be stored in a file based on published date and time.
Webharvest made the work very easy. If you are familiar with HTML, javascript and XML you can easily learn webharvest in two or three days.
It is written in Java. So we can use it in many different platforms (Windows,Linux) without any issue.
We know that each and every webpage format will be different which makes the data extraction difficult.
In webharvest, we can use xpath and xquery for easily extracting data from many websites using different formats.
Webharvest uses xml based configuration file to specify the extraction processes.
Within this xml config. file we can use javascript code also.
It seems it uses the ECMA based javascript. So the javascript working on Internet Explorer may not run with Webharvest. We may need to slightly change the script to make it work with webharvest.
We know that xpath can be used only with xml documents. So the webharvest is having cleaning processor to automatically convert any HTML page into xml page.
It provides an user-friendly GUI for developing, debugging and testing the code.
More Articles...
Shop at Rajamanickam.com | Birthday Gift Idea? | Hire me for $6 per Hour
Get 3 useful ebooks for Rs 99 in India and $5.99 globally
Get a 75% commission | ChatGPT and and Google Gemini for Beginners (Use Discount code QPT)
Search This Blog
Art of Talking to AI | Tech eBook | Dream Big | Listen to Dream Big
Today's Deals | Timesheet | Products | 3 ebooks for $5.99 / Rs 99 | Earn 50% commission
About | Privacy | Follow | TOS | WhatsApp | Contact
I may earn a commission from Amazon affiliate links
Today's Deals | Timesheet | Products | 3 ebooks for $5.99 / Rs 99 | Earn 50% commission
About | Privacy | Follow | TOS | WhatsApp | Contact
I may earn a commission from Amazon affiliate links
No comments:
Post a Comment