feedthebot.com How to follow the Google webmaster guidelines




professor pointing at written guideline

This Google guideline states...


Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.


Definitions



search bots - Also known as a "crawler" or "spiders", a search bot follows links to web pages and then reads and retains the information it finds. This information eventually becomes the "copy" of a website in a search engine index. This process is often referred to as "crawling" the web.

session ID - a line of code placed at the end of an URL in order to track a visitor -

www.example.com/widget.php?PHPSESSID=01elm211kprftyhcliulbgekf5cbys6 - (the part in bold is a session id)

arguments - parameters in a URL for example-

www.example.com/something/lookat.php?OrderBy=avail&PropType=&PHPSESSID=

In the above example, "OrderBy=avail"is an argument / parameter, "&PropType=" and "&PHPSESSID=" are also arguments / parameters. They are part of the URL that assigns an input or value depending on what a visitor is looking for.



examples

Examples and Explanations


 

This guideline tells us that the use of session id's and other tracking methods may cause your site to be not indexed, or improperly indexed. Bots may misinterpret tracking arguments or session IDs of one page as several different pages. It tells us that "bots may not be able to eliminate URLs", which means duplicate copies of your web page may be a result of using these techniques.

Most websites do not use session IDs. The websites that do use them are dynamic pages.

If you do not use session IDs or arguments this guideline does not apply to you.

This guideline is one of the several guidelines that emphasizes that your site needs to be search engine crawler friendly.

Use of session IDs can result in several problems for a website in how they are indexed and ranked in search engines. There are many problems possible, here are a couple examples -

 

Duplicate content -

 

One result of session IDs is duplicate content.

If you have a webpage about dogs, the URL might be...

www.mywebsite.com/dogs.php

If you are using session IDs, when a user goes to that page it results in the user seeing a page with a URL of...

www.mywebsite.com/dogs.php?PHPSESSID=01elm211kprftyhcliulbgekf5cbys6

This does not really affect the user, but to a search engine crawler it may (and usually does) appear to be a different page.

A search engine crawler might index a thousand different versions of your "dog" page - even though the only difference is the URL and the session ID at the end of it.

This can result in a search engine indexing multiple copies of what is in reality just one page.

If you have multiple copies of one page on your website, it is not following the guidelines and may appear as an attempt to manipulate search engine results and can be deemed as "spam".

 

URLs longer that 255 bytes

 

Session IDs can make a URL longer than 255 bytes. If your URL is longer than 255 bytes you are not following the recommendations of the Hypertext Transfer Protocol.

This can result in the Google crawl error - URLs not followed /Redirect URL too long

 

Solutions

The best way to know that your site is being crawled properly is to join the Google sitemap program in Google webmaster tools.

When you have a Google sitemap, you will be told what problems Googlebot is having (if any).

 

 

How to determine if your website is following this guideline



Checking for a clear hierarchy


A good way to determine if your site is organized well is to make a site map page or review your current site map page. Since a site map page is like an outline of your websites pages, it gives a good organizational view of your site. If that outline is clear and makes sense you probably have a clear hierarchy. Most small sites have no problem with this.

 

Checking for static text link navigation.

The way Google recommends to discern how a search engine spider may see your site is to view your page in a text browser.

A quick check to see what links are visible to a search engine spider can be done through a search engine spider simulator. There are many available on the web, or use our spider simulator


important concepts

Key Points

- The use of session Ids and excessive arguments may confuse bots and cause your site to be poorly indexed.

- Bots may not distinguish different URLs pointing to the same content and may result in duplicate content.

 

figure studying

Resources

 

From Google:

Google webmaster tools (sitemaps):

About Google Webmaster Tools

Google help pages -

The help section for crawling

How Google crawls my site

 

From other sources:

Matt Cutts, a Google employee speaks of googlebot control:

Bot Obedience: Herding Googlebot

An article:

Let the Spider Crawl

 


Next