Web crawlers, also referred to as web spiders, have been extensively studied ever since the world wide web was launched. More recently, researchers have been concerned with web crawlers that attempt to crawl parts of the web that require completion of forms, which represent parts of the deep web.
According to recent research studies and the statistics of BrightPlanet published in 2012, the amount of data included in the Deep Web is 400-500 times more than what one can find via the Surface Web. Automatic data harvesting from the Deep Web has evolved to become one of the hot research topics in the field of web crawlers.
Evaluation results have shown that AutoCrawler covers more dynamic interactions and also fetches more valuable data from the real-world web applications, which offers new means for crawling a large amount of data from the deep web.
AutoCrawlers’ applications and contributions of the research study:
Crawling modern AJAX-based web systems requires a different approach than the traditional way of extracting hypertext links from web pages and sending requests to the server. This study proposed an automated crawling technique for AJAX-based web applications, via AutoCrawler, which is based on dynamic analysis of the client-side web user interface in embedded browsers.
The main contributions of this study are:
– An analysis of the main problems involved in crawling AJAX-based applications including pop-up windows and clickable elements.
– A systematic process and algorithm to drive an AJAX application and infer a state machine from the detected state changes and transitions. Challenges addressed include the identification of clickable elements, the detection of DOM changes, and the construction of the state machine.
– A concurrent multi-browser crawling algorithm introduced via Autocrawler, to improve the runtime performance and increase the yield of crawled material from the deep web.
– The open source tool called CRAWLJAX, which implements the crawling algorithms implemented via AutoCrawler.
– Two studies, including seven AJAX applications, utilized to evaluate the effectiveness, performance, correctness, and scalability of the proposed crawling approach.
Although the study has focused on AJAX and associated web applications, it is believed, according to authors of the paper, that the same approach could be applied to any DOM-based web application and related websites. The fact that the tool will soon be freely available for download, being an open source piece of code, will help to identify a myriad of exciting case studies in the near future.
Furthermore, further strengthening of the tool by extending its set of functionalities, improving the accuracy, performance, and the state explosion algorithms are directions that will be subject of the authors’ future work, which denote that the yield of crawling of AJAX based web applications will increase greatly in the near future and will totally reformulate the current definition of the deep web.