Diagrams for Solving Crawl Priority & Indexation Issues
December 28, 2009
Google’s Indexation Cap
December 28, 2009
Google (very likely) has a limit it places on the number of URLs it will keep in its main index and potentially return in the search results for domains.
Let’s examine some of the potential metrics Google looks at to determine indexation:
- Importance on the Web’s Link Graph
We’ve talked previously about metrics like a domain-level calculation of PageRank (Domain mozRank is an example of this). It’s likely that Google would make this a backbone of the indexation cap estimate, as sites that tend to be more important and well-linked-to by other important sites tend to also have content worthy of being in the index. - Backlink Profile of the Domain
The profile of a site’s links can look at metrics like where those links come from, the diversity of the different domains sending links (more is better) and why those links might exist (methods that violate guidelines are often getting caught and filtered so as not to provide value). - Trustworthiness of the Domain
Calculations like TrustRank (or Domain mozTrust in Linkscape) may make their way into the determination. You may not have as many links, but if they come from sites and pages that Google trusts heavily, your chances for raising the indexation cap likely go up. - Rate of Growth in Pages vs. Backlinks
If your site’s content is growing dramatically, but you’re not earning many new links, this can be a signal to the engine that your content isn’t “worthy” of ongoing attention and inclusion. - Depth & Frequency of Linking to Pages on the Domain
If your home page and a few pieces of link-targeted content are earning external links while the rest of the site flounders in link poverty, that may be a signal to Google that although users like your site, they’re not particularly keen on the deep content – which is why the index may toss it out. - Content Uniqueness
Uniqueness is a constantly moving target and hard to nail down, but basically, if you don’t have a solid chunk of words and images that are uniquely found on one URL (ignoring scrapers and spam publishers), you’re at risk. Google likely runs a number of sophisticated calculations to help determine uniqueness, and they’re also, in my experience, much tougher on pages and sites that don’t earn high quantities of external links to their deep content with this analysis. - Visitor, CTR and Usage Data Metrics
If Google sees that clicks to your site frequently result in a click of a back button, a return to the SERPs and the selection of another result (or another query) in a very short time frame, that can be a negative signal. Likewise, metrics they gather from the Google toolbar, from ISP data and other web surfing analyses could enter into this mix. While CTR and usage metrics are noisy signals (one spammer with a Mechanical Turk account can swing the usage graph pretty significantly), they may be useful to decide which sites need higher levels of scrutiny. - Search Quality Rater Analysis + Manual Spam Reports
If your content is consistently reported as being low value or spam by users and or quality raters, expect a visit from the low indexation cap fairy. This may even be done on a folder-by-folder basis if certain portions of your site are particularly egregious while other material is index-worthy (and that phenomenon probably holds true for all of the criteria above as well).
Now let’s talk about some leading indicators that can help to show if you’re at risk:
- Deep pages rarely receive external links – if you’re producing hundreds or thousands of pages of new content and fewer than “dozens” earn any external link at all, you’re in a sticky situation. Sites like Wikipedia, the NYTimes, About.com, Facebook, Twitter and Yahoo! have millions of pages, but they also have dozens to hundreds of millions of links, and relatively few pages that have no external links. Compare that against your 10 million page site with 400K pages in the index (which is more pages than what Google reports indexing on Adobe.com, one of the best linked-to domains on the web).
- Deep pages don’t appear in Google Alerts – if Google Alerts is consistently passing you by (not reporting, this can be (but isn’t universally) an indication that they’re not perceiving your pages as being unique or worthy enough of the main index in the long run.
- Rate of crawling is slow – if you’re updating content, links and launching new pages multiple times per day, and Google’s coming by every week, you’re likely in trouble. XML Sitemaps might help, but it’s likely you’re going to need to improve some of those factors described above to get in good graces for the long term.
Google Crawls RSS/Atom Feeds to Discover New URLs Faster!
October 30, 2009
Google has launched a new feature that uses RSS and Atom feeds to discover new web pages quicker. These features will allow Google to process and index web pages much faster than traditional methods and allow new content to be displayed on search results as soon as it goes live.
Google may use several potential sources to access feed updates, including direct crawls of feeds, notification services, or Reader.
In order for Google to use RSS/Atom feeds to discover & index your content, it is important that your robots.txt files allow the feeds to be crawled. You can test if the Googlebot can crawl your feeds by testing your feed URLs with robots.txt tester in Google Webmaster Tools.
Source: http://blog.searchenginewatch.com/091030-021320




