Locating Duplicate Content within Your Website
When achieving top search engine rankings in Google are important to you, you will want to ensure that your site does not have problems with duplicate content. Below are some ways to identify duplicate content and how to keep it from diluting your website’s theme.
Duplicate Content – Blogs
Blogs are a great way to easily share information and interact with your web visitors. Certain features within a web blog can automatically generate multiple web pages within the same content, causing problems with duplicate content.
Things such as category pages, trackback urls, archives, and RSS feeds are automatically created in blog programs such as WordPress and should be dealt with as soon as possible.
To prevent these specific areas of your blog from having duplicate content, you can simply tell the search engines not to index specific directories where the duplicate content resides within the server.
Keep in mind that often times you will not find these directories on the server itself, they may be dynamically generated on the fly via a call to your database.
Add the following to the robots.txt file to keep WordPress from creating duplicate content:
- Disallow: /category/
- Disallow: /trackback/
- Disallow: /feed/
The disallow functions listed above tell Google that they are not to index any pages within these folders. This provides you with the ability to control what Google does and does not index within your website, at a folder level. If you do not wish to have specific files indexed you will need to use the meta robots tag at the page level also.
Duplicate Content – Content Management Systems
A CMS is one of the most convenient ways to add copy to your website without the need for a web designer each time a change needs to be made.
They are easy to use and built so that almost anyone can easily begin the implementation process without a lot of training or information regarding the system.
Oftentimes, these content management systems create duplicate content in an attempt to serve pages in different versions for visitors.
Two of the biggest culprits of this would be:
- Printer Friendly Versions
- Downloadable Versions (Word Docs / PDF files)
There is absolutely nothing wrong with having printer-friendly versions and multi-formatted versions on your website, however, they are not in any way beneficial to the search engines; therefore it is in your best interest to disallow them from within the Robots.txt file. Below is an example of how you can prevent Google from indexing these types of duplicate pages:
- Disallow: /printer-friendly/
- Disallow: /pdf/
- Disallow: /word/
Keep in mind that all the examples shown above are simply examples. You will need to locate the proper location of these folders and make the necessary modifications within the robots.txt file.
If you wish to check how your changes affect your website, you can use the tool provided within the Google Webmaster Console which allows you to see which folders are able to be indexed by Googlebot.
One last note regarding the Robots.txt file.
Please do not ever place the following in your robots.txt file:
- Disallow: /
Essentially that means disallow everything within the root folder…. We have actually had people who couldn’t get any of their pages indexed within the search engines, only to find that this call was within the robots.txt file. I believe that the major search engines have modified the way they read the above call to where it means disallow “nothing” however I don’t recommend taking this chance.