Preventing Duplicate Content in Google, Yahoo, and other Search Engines

Do You Like This Post?

Google, Yahoo, Bing, and other search engines have come a long way in terms of indexing duplicate content. Although detected duplicate content doesn’t necessarily penalize your site in terms of how the search engines perceive the content, it can penalize your site by dividing users who access your site. Search engines will most likely still index your duplicate content, and thus, when a search is termed up, it could be directed towards any version of the duplicate content.

There are several ways of fixing duplicate content issues. We’ll create a scenario to determine how to fix it.

We have an e-commerce site, let’s just say, www.store.com.

The products within the store are displayed with several parameters because we’ve categorized it in multiple ways. As such, product 1 can be accessed via any of the URLs below:

www.store.com/product.php?product_id=1&product_group_id=1&product_family_id=1&tabno=1 www.store.com/product.php?product_group_id=1&product_family_id=1&product_id=1&tabno=1 www.store.com/product.php?product_family_id=1&product_group_id=1&product_id=1&tabno=1 www.store.com/product.php?product_id=1

Or any number of other combination for the order of the parameters.

Even though the URLs are all different, they serve up the same content, a page with the details of the product. Search engines index all versions of that URL that exist via links or any other emails. As long as they can crawl it, it will be indexed. As such, we have a page that could be indexed multiple times as separate URLs even though it has the same content.

The best way to avoid this problem is to rewrite the parameters to a more SEO-friendly URL. However, if your e-commerce platform or server does not allow for URL rewriting, there are other methods, which is what we will describe below.

Canonical Tag

The most widely accepted way to make sure that search engines know the correct URL to point to is to use the canonical tag. The canonical tag is placed between the <head></head> tags in your html code and is written as such:

<link rel="canonical" href="[URL]" />

For our example, we can write the following:

<link rel="canonical" href="http://www.store.com/product.php?product_id=1" />

We’ve chosen this URL because it’s the shortest and most simplified version of the URL that will display our content correctly.

Robots.txt

Another method is to use the robots.txt, which is a text file that can be placed within the root of your site and can direct what robots should and shouldn’t read within your site structure. By using the robots.txt, you can tell search engine robots to avoid indexing URLs with specific parameters. In this case, what we can do with the robots.txt is written as the following:

User-agent: * Disallow: /?product_family_id= Disallow: /?product_group_id= Disallow: /?tabno=

The first line, User-agent: * indicates that this should apply to all search engine robots. The next few lines specifically tell these robots to not index anything with the product_family_id, product_group_id, and tabno parameters in the URL. We don’t want to list the product_id parameter because it is the essential parameter we are using for the canonical URL of the product.

The caveat to this is that the disallow is very specific, as such, it will only disallow search engine robots from indexing content where the specific parameter is the first one listed in the URL. In other words, the following:

Disallow: /?product_family_id=

Only applies to any URLs that look like the following:

www.store.com/product.php?product_family_id=1&product_group_id=1&product_id=1&tabno=1 www.store.com/product.php?product_family_id=1&product_id=1&product_group_id=1&tabno=1

Note that the product_family_id is the first parameter that shows up in the URL.

Google Webmaster Tools – Parameter Handling

If you have Google Webmaster Tools set up for your site, you can use a feature called Parameter Handling to combat duplicate content issues. This, of course, only works for Google search engine indexing.

To use Parameter Handling, within Google Webmaster Tools, navigate to the following:

Dashboard > Settings > Parameter Handling > Adjust Parameter Settings

Within that, you’ll want to set up something similar to the following:

Search Engine Google Parameter Handling

This will tell Google to ignore the product_family_id, product_group_id and tabno parameters, but not ignore the product_id parameter. So now any URLs with those parameters will be ignored.

Yahoo Site Explorer – Dynamic URLs

If you have Yahoo Site Explorer set up for your site, you can use a feature called Dynamic URLs to combat duplicate content issues. This, of course, again, only works for Yahoo search engine indexing.

To use Dynamic URLs, within Yahoo Site Explorer, navigate to the following:

My Sites > Actions > Dynamic URLs

Within that, you’ll want to set up something similar to the following:

Search Engine Yahoo Dynamic URLs

This will tell Yahoo to remove any URLs with the listed parameters.

These are just some of the methods for helping the fight against duplicate content issues within search engines. If you have any other suggestions, feel free to leave us a comment.

Leave a Reply

Sign Up for SEO/SEM Strategies

Connect With Me Online:

Follow Me Via RSS Follow Me On Twitter Like Me On Facebook Link With Me On LinkedIn