Jan 28, 2009: I just put this site up, so hopefully it will expand a bit each day.
(This page is under construction).
Building Web Crawlers, Spiders, Robots and Agents- Intelligent Search Engines for Research
Web Crawlers (also known as spiders, robots and agents) are nothing more than a program which seeks out internet pages and content which are related to rules. Those rules can be anything you decide. For this reason, Web Crawlers make excellent research tools, hence my interest in them as a researcher.
Just to clear the air, Web crawlers are *NOT* viruses, which is a common misconception. Crawlers never infect anything. You can think of a crawlers as being no different than your web browser - you give it a URL address like www.mypage.com and it waits for the computer on the other end to deliver the web page back to you. The difference is a web crawler does more than just show you a retrieved web page. A Web crawler is capable of interpreting data on the page to determine:
- How is the DOM (Document Object Model) of the page implemented? How is it structured?
- What are the links on the page and what do they refer to?
- Are there tables and how is the data organized?
- What types of content exist such as tables, text, images, and other media like video sound and images?
- Based on the content, should additional links be added for further crawling?
Once the structure has been determined, all of this information can now be placed into a database for later analysis.
After returning results which are far more specific and better organized, I began to optimize the database structure, initially using Microsoft Access so it could easily be modified, tested and modified again. I have also found Perl to be best suited for the massive amounts of text processing needed to data mine and index related words and relationships. I should note that Perl is also common language available for different operating systems and a good candidate for this task, but Access was great to prototype my original concepts and to resolve some challenges in the beginning.
One of the most important issues I had solved was the case of some web servers not serving the correct content unless my crawler metatags matched the signature of a standard browser. Once I matched the metatag signatures, the pages were then correctly served to my crawler. I suspect many other crawlers do not do this, and therefore do not capture the proper content. As a side note, I have also encountered challenge-response scripts which test if the client is a browser or crawler. For these sites, some custom programming resolved the issue.
If you would like to build a customized Web Crawler for your research needs, I can perform these services.
For those who wish to build their own, I would like to recommend the site ActiveState.com which produces a windows version of Perl under the product name of ActivePerl which I have been using in my spare time to test various "spidering" techniques. If you do install Perl, be sure to also install the LWP module, since it is invaluable for doing any type of network communication with the internet.