By Cade Metz for The Register. This
story was reproduced with permission.
"In the past few months we have been exploring some HTML forms
to try to discover new web pages and URLs that we otherwise
couldn't find and index for users who search on Google," Googlers
Jayant Madhavan and Alon Halevy have told the world from the
company's
Webmaster Central Blog. "Specifically, when we encounter a
'Form' element on a high-quality site, we might choose to do a
small number of queries using the form."
In essence, Googlebots are plugging data into such forms - much
as an ordinary web surfer would. This generates a new page, and
then the bots crawl that page. "For text boxes, our computers
automatically choose words from the site that has the form; for
select menus, check boxes, and radio buttons on the form, we choose
from among the values of the HTML," Madhavan and Halevy continue.
"Having chosen the values for each input, we generate and then try
to crawl URLs that correspond to a possible query a user may have
made.
"If we ascertain that the web page resulting from our query is
valid, interesting, and includes content not in our index, we may
include it in our index much as we would include any other web
page."
Of course, they insist these bots would never do evil. "Needless
to say, this experiment follows good Internet citizenry practices.
Only a small number of particularly useful sites receive this
treatment, and our crawl agent, the ever-friendly Googlebot, always
adheres to robots.txt, nofollow, and noindex directives. That means
that if a search form is forbidden in robots.txt, we won't crawl
any of the URLs that a form would generate."
This is all part of Google's plan to index stuff it's never
indexed before. "HTML forms have long been thought to be the
gateway to large volumes of data beyond the normal scope of search
engines," the blog post concludes. "The terms Deep Web, Hidden Web,
or Invisible Web have been used collectively to refer to such
content that has so far been invisible to search engine users. By
crawling using HTML forms (and abiding by robots.txt), we are able
to lead search engine users to documents that would otherwise not
be easily found in search engines, and provide webmasters and users
alike with a better and more comprehensive search experience."
According to a blog
post from a researcher who once worked with Halvey on similar
technology, Google's new form-happy bots grew out of work done by
Transformic, a company Google acquired back in 2005. "The
Transformic team have been been working hard for the past two years
perfecting the technology and integrating it into the Google
crawler," writes Anand Rajaraman.
© The Register
2008