![]()
ACM TechNews
Google Aims to Penetrate the Deep Web With HTML Forms Crawling
Computerworld (04/11/08) Havenstein, HeatherGoogle has been using HTML forms such as drop-down boxes and select menus to find Web pages in the "Deep Web," content that is otherwise invisible to search engines. The company sees HTML forms crawling as another way to improve its coverage of the Web, and says the ability to lead users to documents in the Deep Web ultimately will enhance the search experience. With text boxes, computers automatically select words from the site that has the form; and for select menus, check boxes, and radio buttons on the form, the crawling and indexing team selects from among the values of the HTML. "Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made," say Google's Jayant Madhavan and Alon Halevy in a blog post. "If we ascertain that the Web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other Web page." The team does not engage sites that include instructions against crawling, and omits forms that require passwords or personal information. The method does not affect page ranking.
http://www.computerworld.com.au/index.php/id;1832852726
© Copyright 2008 Information, Inc. This service may be reproduced for internal distribution.