Google officially stated the other day that they’ll be crawling through HTML forms. It’s an interesting move — their stated goal is to increase their coverage of the web by adding this new aspect to their crawling. I’ll note right away that this is not an immediate general addition to their crawling practices:
Only a small number of particularly useful sites receive this treatment, and our crawl agent, the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives. Google Webmaster Central Blog
The key element is “only a small number of particularly useful sites.” One is bound to hope that they’ll be making these choices very carefully — but you never know. Will they tell you if they’re going to unleash their form-crawling ‘bot on your site?
So here’s the big question: is this going to benefit or harm your web site?
On the one hand, Google will be able to find more documents on your site. If you have pages which are only accessible via form submission — for example, behind entry pages which require you to select your country in order to continue — Google may now be able to locate this.
But on the other hand, Google may now be crawling pages which require agreement to terms of service before they can be visited. You may have assumed that these pages wouldn’t be available because they were hidden behind a form — but now, these documents are popping up directly in the search results pages, and users are jumping directly to them without agreeing to the terms of use. Furthermore, Google may now be generating thousands of search results pages from your site which cause duplicate content problems. Of course, they may be able to detect this.
These scenarios are altogether hypothetical, of course, but they are both entirely possible.
One of the chief problems is that web developers and consultants have long based their choices at least partially on the assumption that Google and other search engines would not crawl forms. As such, using a form gateway was an easy way for reasonably knowledgeable developers to prevent the indexing of the content behind it. Certainly, there were other ways — but there was no reason not to do it like this.
It’s a funny thing — the web is littered with web sites which are blocked from search engine access because of developer ignorance. These sites have blocked search engines via a carelessly created robots.txt file, by form-based entry pages, or through fully Javascript-driven navigation menus. There’s a direct motivation for Google to develop methods to find this content. But there are also numerous examples of developers who have intentionally used the same methods to prevent content from being detected by search engines.
Is Google capable of differentiating by intention? I doubt it, at this point. In the long run, this is just another place where education of webmasters and developers will be critical. We’ll just have to hope that nobody has to learn their mistakes when Google deletes their entire website because they left their administrative area unsecured — except by a form.
Whether the ability of search engines to crawl forms is a good thing or a bad thing depends a lot on the ability of search engines to choose whether they should submit a form. Forms are used for so many reasons online that it’s hard to imagine a logical means to reliably detect the search value of indexing the results of a form. Ultimately, I hope that when this comes into general use (if it ever does) Google will take some steps to warn webmasters of the potential results — or even create a new robots meta tag which allows them to prevent form crawling.



Overall I think its a good thing… If it helps to uncover more of the web then its going to be eaiser to find what we want. Matt Cutts post on the topic points out some great benefits too.
Comment by ineedhits Online Advertising — May 8, 2008 @ 10:02 pm