Sunday 20 November 2011

Panda DNA: Algorithm Tests on the Google Panda Update_focusa2z

Tests that had been running when I wrote that article have now been concluded and today I’d like to present some remarkable conclusions.
Test Setup
We needed full control over our Google Panda update tests, so we needed to have some of our own sites affected. Sites with no real history would give the cleanest results, but we also wanted to test how various historic aspects affect Panda. We chose to work with new content for every site, but we made sure the domains had a different history in inbound links.
We ended up with 50 separate domains. None of them could be linked to each other through signals like ownership, hosting, Google accounts, or link partners. We focused on the English language, because Panda was only active on Google.com when we started.
By gradually increasing the amount of textual spam (combining all types we could think of), we wanted to find out when certain domains would be affected and in which manner. Because we knew Panda was a machine learning algorithm, we expected there to be many separate patterns and thresholds for various types of content spam. This makes it much harder to reverse engineer individual spam activities, so that wasn’t the goal of this test.
Test Goals
Before we did our tests, we collected as many Panda examples as possible. What the affected sites seemed to have in common had everything to do with thin content.
Strangely enough, various sites with this same problem weren’t affected at all. And sites with great content were affected because of some technical issues that caused additional duplicate content. The effects of Panda seemed just as strict for all affected website, but was this really the case?
We wanted to know:
Can you be affected by Panda in various degrees?
What affects the amount of thin content that is condoned?
What is the best way to recover from Panda?
For the degree of Panda we looked at how much the ranking dropped from the starting situation, where all content was unique. By increasing the amount of spam, we wanted to see if there was a gradual or absolute threshold in ranking effect.
The degree was also measured in the amount of pages or search terms affected and the effect on remaining unique quality pages.
To see what caused the threshold to be different for different websites we looked at the percentage of quality content versus unnatural content. We also looked at aspects represented in the incoming links of our domains. How were they treated differently?
For the recovery aspect, we first had to get our sites affected and then we would return to various previous states of increased quality where they weren’t affected yet.
Test Outcome
Outcome 1: Can you be affected by Panda in various degrees?
We can conclude that we were unable to affect a single page in multiple degrees. Pages were either affected or not. The scale in which multiple pages are affected does however grow when sections of your website start to misbehave as well.
The threshold of the Panda effect isn’t based on a single page. Multiple pages need to misbehave and then all similar pages drop in ranking.
Similar is, however, a strange concept. Similar is definitely domain based. The exact same spam on another domain could still rank. It seems partly based on internal navigation structure and similarities in page templates.
Even quality pages in sections with low quality content are affected in the same way. Home pages were only affected when the entire site was of an extremely low quality.
Outcome 2: What affects the amount of thin content that is condoned?
The scale in which pages are of a low quality is definitely a factor, but we were unable to pinpoint a percentage high quality versus low quality pages. On some domains just 20 low quality pages in a section with 100 medium/normal quality pages caused all pages to drop in ranking, but in most cases the amount of low quality pages needed to exceed the quality ones by tenfold.
We tried to apply the same degree of spam to all domains at the same time. The text was unique, but the tactics and percentages were quite similar.
Strangely enough we saw different thresholds being applied to different domains. The domains that were affected early on had one commonality: Their inbound links (from a previous site on the same domain) could be seen as low quality.
Sites with no previous links were affected first; then came the ones with just directory links to them; and finally the ones with various government links and ones with a diversified link profile dropped in ranking. This was the commonality we found, but we could be mistaken.
Outcome 3: What is the best way to recover from Panda?
We had only 50 domains which weren’t that similar anymore after we did our previous tests to get them affected. To get a ruling on a “best way” to recover we needed a larger reference with hundreds of sites.
After contacting affected sites we ended up with 250 tests:
150 tried to reduce the percentage of low quality pages by removing them or replacing them by quality pages gradually.
100 instantly removed (or disallowed) all low quality pages and started writing additional quality content.
50 did various tests, mainly by implementing additional tricks to pass their low quality content as quality text.
The ones that gradually increased quality eventually came back. The remaining low quality pages started ranking again and they could get away with some of it. The instant group returned much quicker, but at a greater cost. All previous ranking with low quality pages was lost.
From our own sites, we remarkably found that the threshold that got us affected was different than the one required to get us back. Much more quality was needed to really prove we had bettered our lives.
One trick we tried was moving to another domain, including redirects and improving just slightly above the threshold that got the ranking drop. Surprisingly enough it worked, but after increasing the spam again we got dropped a second time.
Possible Conclusions
Testing is fun, but we aren’t sure any of it was representative enough to make any conclusions. What was true then could already be changed by Google.
Here are some potential conclusions. Learn what you can, but don’t copy it blindly.
Panda uses a steep threshold. Pages are either affected or not.
The threshold isn’t the same for each domain. Links seem to cause some difference.
Almost all keywords on an affected page drop in ranking.
Entire sections of pages are affected. Drops for a single page are unlikely to be caused by Panda.
You can recover from Panda by removing low quality pages from the index (canonical, noindex, etc.).

No comments:

Post a Comment

Get 1000 Friends Instant