You have duplicate content when:
- you have more than one version of any page
- you reference any page with more than one URL
- someone plagiarizes your content
- you syndicate content
And it’s a problem for two reasons:
- Duplicate content filter – Let’s say there are two pages of identical content out there on the Web. Google doesn’t want to list both in the SERPs, because it’s after variety for searchers. The duplicate content filter identifies the pages, then Google applies intelligence to decide which is the original. It then lists only that one in the SERPs. The other one misses out. Problem is, Google may choose the wrong version to display in the SERPs. (There’s no such thing as a duplicate content penalty.)
- PageRank dilution – Some webmasters will link to one page/URL and some will link to another, so your PageRank is spread across multiple pages, instead of being focused on one. Note, however, that Google claims that they handle this pretty well, by consolidating the PageRank of all the links.
Below are some examples of duplicate content and how to resolve them.
Multiple versions of the same page
Multiple versions of the same page is clearly duplicate content. (e.g. A print-friendly version and the regular display version.) The risk is that Google may choose the wrong one to display in the SERPs.
Use a no_follow link to the print-friendly version. This will ensure that Google’s bots don’t crawl it, and that it won’t be indexed. The HTML of a nofollow link looks like this:
<a href="page1.htm" rel="nofollow">go to page 1</a>
Or use your robots.txt file to tell the search bots not to crawl the print friendly version.
Multiple URLs for a single page
Even though there’s really only one page, the search engines interpret each discrete URL as a different page. There are two common reasons for this problem:
- No canonical URL specified
- Referral tracking & visitor tracking
No canonical URL specified
A canonical URL is the master URL of your home page. The one that displays whenever your home page displays. For most sites, it would be http://www.yourdomain.com.
Test if your site has a canonical URL specified. Open your browser and visit each of the following URLs (substituting your domain name, of course).
- http://www.yourdomain.com/index.html/ (or index.htm)
- http://yourdomain.com/index.html/ (or index.htm)
If your home page displays, but the URL stays exactly as you typed it, you have not specified a canonical URL, and you have duplicate content.
- Choose one of the above as your canonical URL. It doesn’t really matter which one. Then redirect the others to it with 301 redirects. (Your web developer should know how to set up a 301 redirect, but just in case, here’s a 301 Redirect How to…)
- Specify your preferred domain in Google Webmaster Tools (you have to register first). To do this, at the Dashboard, click your site, then click Settings and choose an option under Preferred domain. This is the equivalent of a 301 redirect for Google. But it has no impact on the other search engines, so you should still set up proper 301 redirects.
- Create and submit an open format (Google) sitemap and ensure that it uses only the appropriate (‘canonical’) URLs. (See ‘Create an open format / Google sitemap.
Referral tracking & visitor tracking
If you’re storing referrer and/or user data in your URLs, your URLs will vary depending on who the visitor is and/or what link they clicked to arrive at your page. This may be the case if you manage a forum (e.g. a phpBB 2.x forum) or you participate in an affiliate program.
In addition to the duplicate content filter and PageRank dilution problems, this sort of duplicate content makes your site a ‘bot-trap’, significantly increasing the time it takes search engine bots to crawl your site. In Google’s words:
Duplicated content can lead to inefficient crawling: when Googlebot discovers ten URLs on your site, it has to crawl each of those URLs before it knows whether they contain the same content (and thus before we can group them as described above). The more time and resources that Googlebot spends crawling duplicate content across multiple URLs, the less time it has to get to the rest of your content.” (From the Google Webmaster Central Blog)
- If you host a forum on your site, find out if upgrading to the most recent version will resolve the problem. (e.g. phpBB 3.0 handles dynamic URLs in a search-friendly way.)
- Devise an appropriate strategy for referrer/visitor tracking. This is well beyond the scope of this book (and my expertise). Please see Nathan Buggia’s URL Referrer Tracking for more information.
WordPress causes a lot of duplicate content issues by naturally pointing to the same content with multiple different URLs. (e.g. A single post can be accessed through the blog’s home page, search results, date archives, author archives, category archives, etc. And each of these access points has a different URL.)
Solution: For advice on overcoming duplicate content issues on WordPress blogs, see ‘’ on p.77.
Someone has plagiarized your content
If someone has plagiarized your content, Google may mistakenly identify their plagiarized version as the original. This is unlikely, however, because most webmasters who plagiarize content are unlikely to have a very credible, authoritative site.
Solution: You can contact the offender and ask that they remove the content, and you can also report the plagiarism to Google (http://www.google.com/dmca.html). You can also proactively monitor who’s plagiarizing your content using Copyscape.
You syndicate content
If you publish content on your site and also syndicate it, your site’s version may not appear in the SERPs. If one of the sites that has reprinted your article has more domain authority than yours, their syndicated version may appear in the SERPs instead of yours. Also, other webmasters may link to the syndicated version instead of yours.
Solution: One way to try and avoid this situation is to always publish the article on your site a day or two before you syndicate it. Another is to always link back to the original from the syndicated. Whatever the case, the backlink from the syndicated article still contributes to your ranking. You just may not get as much direct search-driven traffic to the article (which really isn’t the point of content syndication, anyway).
Google CAN read Flash (SWF files). A bit. But you should still be very wary of Flash if you want a high ranking. Below is a quick explanation of why.
- Links in Flash may not be ‘follow-able’ – Tim Nash conducted 30 tests over 4 domains to see what information the search engines saw from a Flash file. His results suggest that links in Flash are stripped of anchor text and appear not to be followed.
- Other search engines can’t read Flash at all – Google isn’t the only search engine out there. It may be the biggest, but the others are still important. Yahoo has the ability to read Flash, but at the time of writing, it appears they still don’t. None of the others do.
- Many mobile phones don’t handle Flash well – Over the last year – thanks mostly to the release of the iPhone – mobile search has become very popular. But it’s still very much in its infancy, and many mobile phones don’t yet handle Flash properly.
- More technical reasons to avoid Flash – If you’re still not convinced, read Rand Fishkin’s SEOmoz blog post, Flash and SEO – Compelling Reasons Why Search Engines & Flash Still Don’t Mix. Or read Vanessa Fox’s (ex-Google Webmaster blogger) blog post Search-Friendly Flash?
But if you insist on using Flash…
If you really, really, REALLY want to use Flash, despite all of the potential problems above, then at least make sure you make available an underlying text version of its content, complete with keyword-rich links. Just to be sure. And if it’s a whole page or a whole site, make sure you deliver it at the same URL, so there are no duplicate content issues.
For some more technical advice on optimizing Flash for search, read Jonathan Hochman’s article, How to SEO Flash, first.
Be careful with AJAX
AJAX enhanced sites can deliver a rich visitor experience, but they can also be very difficult for search engine bots to crawl. Following is a checklist to help you develop AJAX pages that are visitor AND search engine friendly.
- Ensure your static links don’t contain a #, as search engines typically won’t read past it.
For more information…
DON’T use text within images
Search engines can’t read words that are presented in image files. So if you embed your copy in images, you’ll find it a lot harder (if not impossible) to get indexed for the right searches.
So don’t have your graphic designer lay out a beautiful page of copy and save it as a .gif or .jpg file, then upload it to your site. For all its beauty, it’ll be completely wasted on the search engines.
Your best bet is to present all important text as straight HTML text. You can get fancy with sIFR text replacement if you want, but that starts getting fairly complicated, so you’d want to have a pretty good reason.
DON’T rely too heavily on footer links for navigation
Most visitors don’t pay too much attention to footer links. Not surprisingly, the search engines are following suit. Yahoo already ignores them:
The irrelevant links at the bottom of a page, which will not be as valuable for a user, don’t add to the quality of the user experience, so we don’t account for those in our ranking.” (Priyank Garg, director of product management for Yahoo! Search Technology (YST)).
What’s more, Google has filed a patent application for “Systems and methods for analyzing boilerplate…” Although it may not actually use that technology to discount the impact of footer links, it’s certainly not out of the question. Remember Google tends to ignore the things visitors ignore, and to place great emphasis on the things they value.
I’m not saying don’t put nav links down there; I’m saying don’t use them as your main form of navigation.
DON’T use empty hyperlinks with deferred hyperlink behavior
Make sure the targets of all your hyperlinks are real. Don’t use empty hyperlinks that have some sort of deferred behavior.
DON’T use Silverlight
Just don’t do it. Perhaps things will change, but for now, if you want your pages to rank, steer clear of Silverlight. Straight from the horse’s (Google’s) mouth:
…we still have problems accessing the content of other rich media formats such as Silverlight… In other words, even if we can crawl your content and it is in our index, it might be missing some text, content, or links.”
DON’T use Frames or iFrames
Pages that use frames aren’t really single pages at all. Each frame on the page actually displays the content from an entirely different page. The frames and their content are all blended and arranged on the page you see according to the instructions found on another page entirely, called the ‘frameset’ page.
The problem with this is that search engine bots only see the ‘frameset’ page. They don’t see the page you see at all, nor the individual pages that make up the page you see. And this is where the problem lies. Those individual pages may have lots of really helpful, keyword rich content and links, and the search engines don’t see it.
Although you can use the “NoFrames” tag to provide alternate content that the search engines can read, you’ll still be undermining your SEO because the links and content within a frame aren’t considered part of the page they display on. This means there’s no alignment between backlinks pointing to that page and the content that page displays. In other words those backlinks won’t seem as relevant to the bots. Likewise, the links on the display page won’t pass on any PageRank to the pages they point to, because they actually exist on a page that doesn’t have a public URL, and which therefore doesn’t attract any backlinks.
DON’T use spamming techniques
Before discussing what sorts of spam you should make sure you’re not engaging in, let me first say this: it’s almost impossible to spam unintentionally. Search engine spamming usually involves quite a bit of work and knowledge.
But just to be sure, here’s a quick look at what you shouldn’t be doing.
What is search engine spam?
A website is considered search engine spam if it violates a specific set of rules in an attempt to seem like a better or more relevant website. In other words, if it tries to trick the search engines into thinking that it’s something it’s not.
On-page spam is deceptive stuff that appears on your website. According to Aaron D’Souza of Google, speaking at the October 2008 SMX East Search Marketing Conference, in New York City, the following are considered on-page spam:
- Cloaking – Showing one thing to search engines and something completely different to visitors.
- Hidden content – Some webmasters just repeat their keywords again and again and again, on every page, then hide it from visitors. These keywords aren’t in sentences, they’re just words, and they provide no value. That’s why they’re hidden, and that’s why it’s considered spam. The intent is to trick the search engines into thinking that the site contains lots of keyword rich, helpful content, when, in fact, the keyword rich content is just keywords; nothing more. These spammers hide their keywords by using very, very, very small writing (1pt font), or by using a font color that’s the same as the background color.
- Keyword stuffing – Severely overdoing your keyword density. Try to stick to around 3% keyword density. This is the most reader-friendly density. Usually anything over 5% starts to seem very contrived. (See ‘How many times should I use a keyword?’ for more information on keyword density.)
- Doorway pages – Page after page of almost identical pages intended to simply provide lots and lots of keyword-rich content and links, without providing any genuine value to readers.
- Scraping – Spammers who are too lazy or incapable of creating their own content will steal it from other sites, blogs, articles and forums, then re-use it on their own site without permission, and without attributing it to its original author. The intent is to create lots of keyword rich content on their website, and trick the search engines into thinking their site is valuable, without actually doing any of the work themselves.
Off-page (link) spam
According to Google, the following link schemes are considered spam:
- Links intended to manipulate PageRank*
- Links to web spammers or bad neighborhoods on the web
- Excessive reciprocal links or excessive link exchanging
- Buying or selling links that pass PageRank
According to Sean Suchter of Yahoo (now with Microsoft), the search engines are always on the lookout for websites that:
…get a LOT of really bad links, really fast.” (Speaking at the 2008 SMX East search marketing conference.)
They also look out for links out to bad sites.
But if “Links intended to manipulate PageRank” are spam, then every webmaster who follows Google’s ownadvice for improving the ranking of your website is spamming:
“In general, webmasters can improve the rank of their sites by increasing the number of high-quality sites that link to their pages.”
Clearly, in point one above, Google is referring to people who are out-and-out spamming. Creating undeserved links that offer absolutely no value to visitors.
- Choose the right web host.
- Use HTML text copy, links & breadcrumb trails.
- Position your content toward the top of your HTML code.
- Create lots of content and regularly update your site.
- Optimize Your HTML Meta-Tags.
- Use image captions.
- Cluster pages around keywords.
- Optimize your internal links.
- Add Titles to internal links.
- Link to related sites.
- Add a sitemap page.
- Create a Google sitemap.
- Check for broken links.
- Use permanent 301 redirects for changed URLs.
- Create a 404 error handling page.
- Create a robots.txt file.
- Use either subfolders or subdomains – they’re both OK.
- Consider making dynamic URLs static.
- Avoid duplicate content.
- Don’t use text within images.
- Don’t rely too heavily on footer links for navigation.
- Don’t use empty hyperlinks with deferred hyperlink behavior.
- Don’t use Silverlight.
- Don’t use Frames or iFrames.
- Don’t spam.