Content that belongs to various pages of a website can only be enlisted within a search engine’s results page, only if this content has been indexed, or “crawled”, by the search engine’s “web crawlers”, or “web spiders”. A web crawler, or a web spider, is a special script that traverses the internet to index websites’ content. Conventionally, web crawlers can only index websites that belong to the surface web. Accordingly, surface web crawlers cannot index pages that require specific forms of input, e.g. filling forms or authentication via passphrases. To mitigate this problem, hidden web crawlers have been innovated via deploying a concept that relies on automatic result classification, authentication and navigation. As these novel hidden web crawlers can crawl content that belongs to parts of the deep web, this content can be indexed by search engines and appear in search results.
A couple of researchers from Rochester Institute of Technology RIT published a paper that included a new technique for crawling parts of the deep web. The paper also introduced a new feature that exhibited high efficiency in indexing hidden web content and classifying it according to search keywords.
Classification of Hidden Web Crawlers:
As users can browse hidden content only via filling special forms, passing authentication processes or issuing certain queries, conventional web crawlers cannot index it. On the contrary, hidden web crawlers can index such hidden website content and they can also index dynamic web content that relies on specific users’ input. Hidden web crawlers can be classified according to two methods; classification that is based on the crawling method and another which is based on the search keyword selection method.
Classification Based On The Crawling Method:
a- Breadth oriented crawling: deep web crawlers of this type rely largely on traversing a wide range of URL sources, rather than repeatedly crawling content belonging to a limited source of URLs.
b- Depth oriented crawling: these types of deep web crawlers rely on extracting the maximum possible amount of data from the same URL source.
Classification Based On The Keyword Selection Method:
a- Random: this type of deep web crawling method relies on using a random dictionary to obtain the keyword(s) needed to fill forms. In some cases, the used dictionary is domain specific.
b- Generic frequency: this method depends on the generic distribution of frequency of keywords used in filling forms. This aids in yielding more matching results and reduces the time spent in form filling.
C- Adaptive method: Crawlers that rely on this method analyze data derived from queries and shortlist search keywords that yield most content. Via these shortlisted keywords, the hidden web crawler can create queries that can yield maximum results.
The Hidden Web Crawling Technique Proposed in the Paper:
The project included in the paper focused on crawling hidden web content via filling one text field on website forms to decrease the complexity of the form filling task. The researchers used health related websites for testing purposes. The keywords, which were used for filling the forms, were obtained from websites similar to wordstream.com. The crawlers were designed to check the “robots.txt” files of websites before crawling their content. A get/post request was used to submit the forms, but as most websites utilize an API key system for get/post requests, the project used the “Selenium WebDriver” tool to overcome this problem.
The results of implementation of this new deep web crawling concept were interesting. The authors of the paper successfully tested their crawlers with three health related domain websites. By applying keyword selection methods to classify search page results, keywords used in the this project were classified as low coverage, medium coverage and high coverage depending on the search results yielded by each keyword. The efficiency of submissions of the hidden web crawler reached an average of 63.6% which is relatively promising when compared to results of previous works.
The researchers stated that combining CALA, a web page classification method that is based on URLs, with their new hidden web crawling concept can yield even more promising results as this will boost the performance of the web crawlers via generating more accurate patterns for link extraction. The authors of the paper stated that combining CALA with their new deep web crawler will be the scope of some of their future works.