Google Caffeine(咖啡因) 搜索索引全面启动
一个月前 6月9号 Google软件工程师卡丽·格里姆斯(Carrie Grimes)在gg官方博客中宣布,Google已经完成新型网络内容索引系统Caffeine的开发工作,并表示与老式索引技术相比,google Caffeine具有检索速度更快、检索结果更准确等特点。
去年8月期间,Google首次对外公布了Caffeine的部分技术细节。Google开发Caffeine的主要原因是:应对来自微软必应(Bing)、“知识引擎”Wolfram Alpha等竞争对手的挑战,以保持Google在搜索引擎技术产业的领先地位。
格里姆斯称,Caffeine的技术开发工作已经完成。她表示,与Google现有索引技术相比,Caffeine检索结果“时效性高出50%”,检索速度也大幅提高。格里姆斯称,老式索引多采用多层(several layers)技术,而Caffeine则将网络内容划分成不同部分,然后再在全球范围内对这些索引加以连续性升级,从而提高内容检索的连续性和准确性。
原文地址:http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html (必须翻墙)
(以下为了不能翻墙的人留着的原文)
Our new search index: Caffeine
Today, we’re announcing the completion of a new web indexing system called Caffeine. Caffeine provides 50 percent fresher results for web searches than our last index, and it’s the largest collection of web content we’ve offered. Whether it’s a news story, a blog or a forum post, you can now find links to relevant content much sooner after it is published than was possible ever before.
Some background for those of you who don’t build search engines for a living like us: when you search Google, you’re not searching the live web. Instead you’re searching Google’s index of the web which, like the list in the back of a book, helps you pinpoint exactly the information you need. (Here’s a good explanation of how it all works.)
So why did we build a new search indexing system? Content on the web is blossoming. It’s growing not just in size and numbers but with the advent of video, images, news and real-time updates, the average webpage is richer and more complex. In addition, people’s expectations for search are higher than they used to be. Searchers want to find the latest relevant content and publishers expect to be found the instant they publish.
To keep up with the evolution of the web and to meet rising user expectations, we’ve built Caffeine. The image below illustrates how our old indexing system worked compared to Caffeine:

Google Caffeine网络内容索引技术
Our old index had several layers, some of which were refreshed at a faster rate than others; the main layer would update every couple of weeks. To refresh a layer of the old index, we would analyze the entire web, which meant there was a significant delay between when we found a page and made it available to you.
With Caffeine, we analyze the web in small portions and update our search index on a continuous basis, globally. As we find new pages, or new information on existing pages, we can add these straight to the index. That means you can find fresher information than ever before—no matter when or where it was published.
Caffeine lets us index web pages on an enormous scale. In fact, every second Caffeine processes hundreds of thousands of pages in parallel. If this were a pile of paper it would grow three miles taller every second. Caffeine takes up nearly 100 million gigabytes of storage in one database and adds new information at a rate of hundreds of thousands of gigabytes per day. You would need 625,000 of the largest iPods to store that much information; if these were stacked end-to-end they would go for more than 40 miles.
We’ve built Caffeine with the future in mind. Not only is it fresher, it’s a robust foundation that makes it possible for us to build an even faster and comprehensive search engine that scales with the growth of information online, and delivers even more relevant search results to you. So stay tuned, and look for more improvements in the months to come.
Posted by Carrie Grimes, Software Engineer
(以下为英语不是太好的人留着的翻译)
今天我们对外宣布,Google新型网络内容索引系统Caffeine的技术开发工作已经完成。与我们当前使用的索引技术相比,Caffeine检索结果时效性会提高50%,它也是截至目前Google所能提供规模最大的网络内容检索系统。无论是网络新闻,还是博客文章以及论坛发帖内容,一旦这些内容发布,Caffeine将能马上收录这些内容的相应链接。
对于那些并不是以搜索技术开发为生的网民而言,Caffeine的大致技术原理是:你在使用Google搜索过程中,你并不是搜索互联网本身,而是在Google所建立的网络内容索引中查找相关内容。这就好比你先查看某本书的目录,然后再根据目录的提示找到你想阅读的内容。
那我们为何要开发新型Caffeine索引技术?原因就是互联网内容的规模每天都在增长。互联网内容的增长并不仅仅体现在数量上面,而且还出现了视频、图片和实时更新等内容。与以往相比,目前平均每个网页所含信息量比以前更为丰富。此外,网民对搜索引擎性能的期望值比以前更高,他们希望能够更及时查找到互联网上刚刚发布的内容。
为适应互联网产业的向前演进以及满足网民的需求,我们开发了Caffeine索引系统。我们老式索引采用了多层技术,而部分索引层的内容更新快于其他层面;主索引层通常是每隔数周更新一次。如果我们要更新其中的某个索引层,就是必须对整个互联网进行分析。如此一来,网民所搜索到的结果,与互联网的实际内容之间会有一个时间差。
利用Caffeine技术,我们将互联网划分为不同的部分,然后以连续状态在全球范围对不同部分内容加以升级。当我们发现了新内容,只需将这些新内容添加到当前索引当中。这就是说,你在使用Google搜索过程中,所获得的结果与互联网实际内容的时间差已经非常小。
Caffeine技术可以使我们实现对网络内容索引的规模化。事实上,Caffeine每秒钟可同时处理数十万个网页。如果这些网页是现实生活中的纸张,则这些纸张每秒钟将堆成3英里高。Caffeine在一个数据库中可处理近1亿GB的存储信息,且每天存储信息量都在大幅增长。你需要使用62.5万部容量最大的iPod音乐播放器才能存储这些信息,如果将这些iPod并排放置,则可长达40英里。
我们开发Caffeine技术,其实是着眼于互联网产业的未来发展。Caffeine不仅仅提高了网络索引的时效性,而且使我们希望组建性更强大的搜索引擎成为可能,然后再向网民提供质量更好的搜索服务。请关注Caffeine的发展,今后数月内,我们将对Caffeine技术加以进一步完善和改进
作者:瑾瑜书生
本文链接:http://www.stuseo.com/google-caffeine
版权所有©本博客非著名来源均为原创请在转载时以链接形式注明作者和原始出处
Happiness doesn’t depend on any external conditions, it is governed by our mental attitude.