BigTable is a compressed, high performance, and proprietary database system built on Google File System (GFS), Chubby Lock Service, and a few other Google programs; it is currently not distributed or used outside of Google. It began in 2004[1] and is now used by a number of Google applications, such as MapReduce, which is often used for generating and modifying data stored in BigTable[2], Google Reader,[3] Google Maps,[4] Google Print, "My Search History", Google Earth, Blogger.com, Google Code hosting, Orkut[4], and YouTube[5]. Google's reasons for developing its own database include licensing costs, scalability, and better control of performance characteristics.[6]
BigTable is a fast and extremely large-scale column-oriented database system, with a focus on quick reads from columns, not rows. It's designed to scale into the petabyte range across hundreds or thousands of machines, and to make it easy to add more machines to the system and automatically start taking advantage of those resources without any reconfiguration".[7] Each table has multiple dimensions (one of which is a field for time, allowing versioning). Tables are optimized for GFS by being split into multiple tablets- segments of the table as split along a row chosen such that the tablet will be ~200 megabytes in size. When sizes threaten to grow beyond a specified limit, the tablets are compressed using the algorithms BMDiff and Zippy, which are described as less space-optimal than LZW but more efficient in terms of computing time. The locations in the GFS of tablets are recorded as database entries in multiple special tablets, which are called "META1" tablets. META1 tablets are found by querying the single "META0" tablet, which typically has a machine to itself since it is often queried by clients as to the location of the "META1" tablet which itself has the answer to the question of where the actual data is located. Like GFS' master server, the META0 is not generally a bottleneck since the processor time and bandwidth necessary to discover and transmit META1 locations is minimal and clients aggressively cache locations to minimize queries.
Other Implementations
The Hadoop project has made some progress toward a working implementation of BigTable. They call this project Hbase.
"Just as Bigtable leverages the distributed data storage provided by the Google File System, Hbase will provide Bigtable-like capabilities on top of Hadoop."
References
- ^ "First an overview. BigTable has been in development since early 2004 and has been in active use for about eight months (about February 2005)." Google's BigTable
- ^ "Bigtable can be used with MapReduce, a framework for running large-scale parallel computations developed at Google. We have written a set of wrappers that allow a Bigtable to be used both as an input source and as an output target for MapReduce job". pg 3 of "Bigtable: A Distributed Storage System for Structured Data", 2006
- ^ "Reader is using Google's BigTable in order to create a haven for what is likely to be a massive trove of items." Official Google Reader blog.
- ^ a b "There are currently around 100 cells for services such as Print, Search History, Maps, and Orkut." Google's BigTable
- ^ "Their new solution for thumbnails is to use Google’s BigTable, which provides high performance for a large number of rows, fault tolerance, caching, etc. This is a nice (and rare?) example of actual synergy in an acquisition." YouTube Scalability Talk
- ^ "We have described Bigtable, a distributed system for storing structured data at Google....Our users like the performance and high availability provided by the Bigtable implementation, and that they can scale the capacity of their clusters by simply adding more machines to the system as their resource demands change over time...Finally, we have found that there are significant advantages to building our own storage solution at Google. We have gotten a substantial amount of flexibility from designing our own data model for Bigtable." from the Conclusion of "Bigtable: A Distributed Storage System for Structured Data", 2006
- ^ *"Database War Stories #7: Google File System and BigTable"
[External links
- Bigtable: A Distributed Storage System for Structured Data -(official paper; PDF)
- BigTable: A Distributed Structured Storage System (video)
- more video
- Google's BigTable -(notes on the official presentation)
- "How Google Works"
- "Google's BigTable" -(from the blog "Geeking with Greg")
- C-Store and Google BigTable
- Mondrian uses BigTable - by Guido van Rossum
- Bigtable-like structured storage for Hadoop HDFS - (from the Lucene-hadoop wiki)
- MapReduce
- Column-oriented DBMS
Wednesday, December 26, 2007
The Big Table
Google Code for Educators - Google: Cluster Computing and MapReduce
Below are some cool Google videos on cluster computing.
Google: Cluster Computing and MapReduce
This submission contains video lectures and related course materials from a series of lectures that was taught to Google software engineering interns during the Summer of 2007.
Lectures
Distributed systems overview, review of synchronization and networking.
Slides - Introduction to Distributed System Design
Overview of the MapReduce programming model.
Lecture 3 - Distributed File Systems
Overview of distributed file systems with attention to the Google File System.
Slides - The Google File System
Lecture 4 - Clustering Algorithms
Types of clustering algorithms, MapReduce implementations of K-Means and Canopy Clustering
Graph representations, distributed Pagerank, distributed Dijkstra.
Google Code for Educators - Google: Cluster Computing and MapReduce
Tuesday, December 25, 2007
UWTV Program: The Google Linux Cluster
The University of Washington Department of Computer Science and Engineering has a couple of interesting broadcasts related to high performance computing.
The Google Linux Cluster
Google's Linux cluster currently processes over 150 million queries a day, searching a multi-terabyte web index for every query with an average response time of less than a quarter of a second, with near-100% uptime. In this discussion, Google Fellow Urs Hölzle will describe the software and hardware infrastructure that makes this performance possible, as well as provide an overview of the main problems facing a web search, software architecture, servers and compact rack hardware designs.
UWTV Program: The Google Linux Cluster
![]()
![]()

CSE Colloquia - 2002
The University of Washington Department of Computer Science and Engineering presents broadcasts of research colloquia by members of the department and the greater computer science community. The colloquia present cutting-edge research in all areas of computer science.
Included in this series are the following programs:
- Amazon.com: Differentiating with Technology
- Assisted Cognition
- Automatic Tools for Building Secure Systems
- Automating the Design of Visualizations
- Autonomous Computing
- Computer Graphics: Communications Media
- Computer Science Programming Languages
- Data Mining
- Data Structures & Algorithms
- Designing User Interfaces
- Dynamic Invariant Detection
- Embedded Networked Sensing Systems
- Error-Tolerant Networking Protocols
- Fluid Interaction for High Resolution Wall-Size Displays
- Genome: Transcriptional Regulatory Modules
- Herald: Global Event Notification
- Improving Information Interactions
- Information Fusion: Multidocument Summarization
- Interactive Visual Media
- Internet Congestion Control, Bandwidth-Delay Product
- Linear Time Encodable/Decodable Codes
- Logic in Computer Science
- Model Checking Software Artifacts
- Online Science: The World-Wide Telescope
- Parallelizing Programs using Approximate Code
- Proactive Computing: A Progress Report
- PUMA 2: Bridging the Processor/Memory Gap
- Rendering Translucent Materials
- Security Protocols for Broadcast Communications
- Semiconductor Industry, Integrated Solutions
- Sharing and Abstraction in Hierarchical Reinforcement Learning
- Signal-Processing Framework for Forward and Inverse Rendering
- SUDS: Thread Level Speculation with Minimal Hardware Support
- Text Editing: Outlier Finding
- Text Mining with Information Extraction
- The Google Linux Cluster
- Trends in Adaptive Computing
- Visualmotor Tasks and Human Learning
How does the Google platform work.
Google requires large computational resources in order to provide their service. This article describes the technological infrastructure behind Google's websites, as presented in the company's public announcements.
Google's first production server rack, circa 1999
Network topology
Though the numbers are not publicly known, some people estimate that Google maintains over 450,000 servers, arranged in racks located in clusters in cities around the world, with major centers in Mountain View, California; Virginia; Atlanta, Georgia; Dublin, Ireland; and new facilities constructed in The Dalles, Oregon[1] and Saint-Ghislain, Belgium.[2] In 2009 Google is planning one of its first sites in the upper midwest to open in Council Bluffs, Iowa close to abundant wind power resources for fulfilling green energy objectives and proximate to fiber optic communications links.[3]
When an attempt to connect to Google is made, Google's DNS servers perform load balancing to allow the user to access Google's content most rapidly. This is done by sending the user the IP address of a cluster that is not under heavy load, and is geographically proximate to them. Each cluster has thousands of servers, and upon connection to a cluster further load balancing is performed by hardware in the cluster, in order to send the queries to the least loaded Web Server. This makes Google one of the biggest and most complex known content delivery networks.
Racks are custom-made and contain 40 to 80 servers (20 to 40 1U servers on either side), while new servers are 2U Rackmount systems.[4] Each rack has a switch. Servers are connected via a 100 Mbit/s Ethernet link to the local switch. Switches are connected to core gigabit switch using one or two gigabit uplinks.[citation needed]
The Main index
Since queries are composed of words, an inverted index of documents is required. Such an index allows obtaining a list of documents by a query word. The index is very large due to the number of documents stored in the servers.
The type of Servers that Google uses
Google's server infrastructure is divided in several types, each assigned to a different purpose:[4]
- Google DNS Servers answer the DNS requests and serve as intelligent, worldwide load-balancers. They guess the data center nearest to the user to speed up all HTTP requests.
- Google Web Servers coordinate the execution of queries sent by users, then format the result into an HTML page. The execution consists of sending queries to index servers, merging the results, computing their rank, retrieving a summary for each hit (using the document server), asking for suggestions from the spelling servers, and finally getting a list of advertisements from the ad server.
- Data-gathering servers are permanently dedicated to spidering the Web. They update the index and document databases and apply Google's algorithms to assign ranks to pages.
- Index servers each contain a set of index shards. They return a list of document IDs ("docid"), such that documents corresponding to a certain docid contain the query word. These servers need less disk space, but suffer the greatest CPU workload.
- Document servers store documents. Each document is stored on dozens of document servers. When performing a search, a document server returns a summary for the document based on query words. They can also fetch the complete document when asked. These servers need more disk space.
- Spelling servers make suggestions about the spelling of queries.
Server hardware and software
Original hardware
The original hardware (ca. 1998) that was used by Google when it was located at Stanford University, included:[5]
- Sun Ultra II with dual 200 MHz processors, and 256MB of RAM. This was the main machine for the original Backrub system.
- 2 x 300 MHz Dual Pentium II Servers donated by Intel, they included 512MB of RAM and 9 x 9GB hard drives between the two. It was on these that the main search ran.
- F50 IBM RS/6000 donated by IBM, included 4 processors, 512MB of memory and 8 x 9GB hard drives.
- Two additional boxes included 3 x 9GB hard drives and 6 x 4GB hard drives respectively (the original storage for Backrub). These were attached to the Sun Ultra II.
- IBM disk expansion box with another 8 x 9GB hard drives donated by IBM.
- Homemade disk box which contained 10 x 9GB SCSI hard drives.
Current hardware
Servers are commodity-class x86 PCs running customized versions of Linux. Indeed, the goal is to purchase CPU generations that offer the best performance per unit of power, not absolute performance. Estimates of the power required for over 450,000 servers range upwards of 20 megawatts, which could cost on the order of US$2 million per month in electricity charges.
Specifications:
- Over 450,000 servers[1] ranging from a 533 MHz Intel Celeron to a dual 1.4 GHz Intel Pentium III (as of 2005)
- One or more 80GB hard disks per server (2003)
- 2–4 GiB of memory per machine (2004)
The exact size and whereabouts of the data centers Google uses are unknown, and official figures remain intentionally vague. In a 2000 estimate, Google's server farm consisted of 6000 processors, 12,000 common IDE disks (2 per machine, and one processor per machine), at four sites: two in Silicon Valley, California and two in Virginia.[6] Each site had an OC-48 (2488 Mbit/s) internet connection and an OC-12 (622 Mbit/s) connection to other Google sites. The connections are eventually routed down to 4 x 1 Gbit/s lines connecting up to 64 racks, each rack holding 80 machines and two ethernet switches.
Project 02
Google is currently developing a supercomputer at a data center located in the town of The Dalles, Oregon, on the Columbia River, approximately 80 miles from Portland. The project, codenamed "Project 02",[7] is expected to substantially add to their current global network capable of processing billions of search queries per day and a growing repertoire of other services.[7] The new complex is approximately the size of two football fields with cooling towers four stories high.
Server operation
Most operations are read-only. When an update is required, queries are redirected to other servers, so as to simplify consistency issues. Queries are divided into sub-queries, where those sub-queries may be sent to different ducts in parallel, thus reducing the latency time.[4]
To lessen the effects of unavoidable hardware failure, data stored in the servers may be mirrored using hardware RAID. Software is also designed to be fault tolerant. Thus when a system goes down, data is still available on other servers, which increases the reliability.
References
- ^ a b Carr, David F. "How Google Works." Baseline Magazine. July 6, 2006. Retrieved on July 10, 2006.
- ^ "[1]." Invest Wallonia. April 27, 2007. Retrieved on May 10, 2007
- ^ "[2]." Council Bluffs. July 9, 2007. Retrieved on August 21, 2007
- ^ a b c Web Search for a Planet: The Google Cluster Architecture (Luiz André Barroso, Jeffrey Dean, Urs Hölzle)
- ^ "Google Stanford Hardware." Stanford University (provided by Internet Archive). Retrieved on July 10, 2006.
- ^ Hennessy, John; Patterson, David. (2002). Computer Architecture: A Quantitative Approach. Third Edition. Morgan Kaufmann. ISBN 1-55860-596-7.
- ^ a b Markoff, John; Hansell, Saul. "Google's quasi-secret power play." San Diego Union Tribune. June 14, 2006. Retrieved on July 10, 2006.
Some Related links
- Google Research Publications
- The Google Linux Cluster — Video about Google's Linux cluster
- Web Search for a Planet: The Google Cluster Architecture (Luiz André Barroso, Jeffrey Dean, Urs Hölzle)
- How Google Works
- Original Google Hardware Pictures
How many servers does Google have?
Tristan Louis has a pretty good estimation of how many servers Google has.
An interesting tidbit coming out of the Google S-1 filing is that they have spent about $250 million on hardware equipment. From there, we can get a few guesses at the magnitude of the Google system. Based on quick back of the envelope calculations, it looks like Google is managing between 45,000 and 80,000 servers. Here’s how I arrived at this conclusion:
According to calculations by the IEEE, in a paper about the Google cluster, a rack with 88 dual-CPU machines used to cost about $278,000. If you divide the $250 million figure from the S-1 filing by $278,000, you end up with a bit over 899 racks. Assuming that each rack holds 88 machines, you end up with 79,000 machines.
However, one must recognize that equipment is not all CPUs. As a result, you must discount the figure of $250 million to account for routers, firewalls, machines for employees, etc… So let’s assume for a minute that only about $200 million is going to the CPUs. That still leaves us with 719 racks or a bit over 63,000 machines.
Even if we discount other equipment to be costing $100 million, we end up with a bit over 31,654 machines on 359 racks.
So how much processing power is that? Well, once again, the Google cluster document provides some interesting tidbits. Per the document, the racks that were used were
88 dual-CPU 2 Ghz Intel Xeon servers with 2 Gbytes of RAM and an 80-Gbytes hard disk.
That means that, on the low end, the Google cluster has the following stats:
- 359 racks
- 31,654 machines
- 63,184 CPUs
- 126,368 Ghz of processing power
- 63,184 Gb of RAM
- 2,527 Tb of Hard Drive space
In the middle range of my estimates, the cluster would have:
- 719 racks
- 63,272 machines
- 126,544 CPUs
- 253,088 Ghz of processing power
- 126,544 Gb of RAM
- 5,062 Tb of Hard Drive space
And on the high end of my estimates:
- 899 racks
- 79,112 machines
- 158,224 CPUs
- 316,448 Ghz of processing power
- 158,224 Gb of RAM
- 6,180 Tb of Hard Drive space
Assuming that the 1Ghz chip is going at about a third the gigaflops of a 2Ghz processor (3.3Gflops), we can then guess at the size of the Google supercomputer. Just for the sake of argument, let’s go with 1 Gigaflop per processor. This means that the Google supercomputer has about 126 teraflops of power on the low end of my estimates, 253 teraflops on the middle end, and 316 teraflops on the high end. This would easily put it on top of the list of fastest computers in the world.
Any way you slice it, that’s a lot of power.
Google and the Wisdom of Clouds
With Google, Yahoo, IBM, Microsoft and Amazon all working on cloud computing its pretty obvious that its not a fad.
![]()
![]()
Google's Next Big Dream
Related Items
Google and the Cloud
The Businessweek Cover Story of December 13, 2007 contains the article that inspired this blog. The article is included below:
Google and the Wisdom of Clouds
A lofty new strategy aims to put incredible computing power in the hands of many
One simple question. That's all it took for Christophe Bisciglia to bewilder confident job applicants at Google (GOOG). Bisciglia, an angular 27-year-old senior software engineer with long wavy hair, wanted to see if these undergrads were ready to think like Googlers. "Tell me," he'd say, "what would you do if you had 1,000 times more data?"
What a strange idea. If they returned to their school projects and were foolish enough to cram formulas with a thousand times more details about shopping or maps or—heaven forbid—with video files, they'd slow their college servers to a crawl.
At that point in the interview, Bisciglia would explain his question. To thrive at Google, he told them, they would have to learn to work—and to dream—on a vastly larger scale. He described Google's globe-spanning network of computers. Yes, they answered search queries instantly. But together they also blitzed through mountains of data, looking for answers or intelligence faster than any machine on earth. Most of this hardware wasn't on the Google campus. It was just out there, somewhere on earth, whirring away in big refrigerated data centers. Folks at Google called it "the cloud." And one challenge of programming at Google was to leverage that cloud—to push it to do things that would overwhelm lesser machines. New hires at Google, Bisciglia says, usually take a few months to get used to this scale. "Then one day, you see someone suggest a wild job that needs a few thousand machines, and you say: Hey, he gets it.'"
What recruits needed, Bisciglia eventually decided, was advance training. So one autumn day a year ago, when he ran into Google CEO Eric E. Schmidt between meetings, he floated an idea. He would use his 20% time, the allotment Googlers have for independent projects, to launch a course. It would introduce students at his alma mater, the University of Washington, to programming at the scale of a cloud. Call it Google 101. Schmidt liked the plan. Over the following months, Bisciglia's Google 101 would evolve and grow. It would eventually lead to an ambitious partnership with IBM (IBM), announced in October, to plug universities around the world into Google-like computing clouds.
As this concept spreads, it promises to expand Google's footprint in industry far beyond search, media, and advertising, leading the giant into scientific research and perhaps into new businesses. In the process Google could become, in a sense, the world's primary computer.
"I had originally thought [Bisciglia] was going to work on education, which was fine," Schmidt says late one recent afternoon at Google headquarters. "Nine months later, he comes out with this new [cloud] strategy, which was completely unexpected." The idea, as it developed, was to deliver to students, researchers, and entrepreneurs the immense power of Google-style computing, either via Google's machines or others offering the same service.
What is Google's cloud? It's a network made of hundreds of thousands, or by some estimates 1 million, cheap servers, each not much more powerful than the PCs we have in our homes. It stores staggering amounts of data, including numerous copies of the World Wide Web. This makes search faster, helping ferret out answers to billions of queries in a fraction of a second. Unlike many traditional supercomputers, Google's system never ages. When its individual pieces die, usually after about three years, engineers pluck them out and replace them with new, faster boxes. This means the cloud regenerates as it grows, almost like a living thing.
A move towards clouds signals a fundamental shift in how we handle information. At the most basic level, it's the computing equivalent of the evolution in electricity a century ago when farms and businesses shut down their own generators and bought power instead from efficient industrial utilities. Google executives had long envisioned and prepared for this change. Cloud computing, with Google's machinery at the very center, fit neatly into the company's grand vision, established a decade ago by founders Sergey Brin and Larry Page: "to organize the world's information and make it universally accessible." Bisciglia's idea opened a pathway toward this future. "Maybe he had it in his brain and didn't tell me," Schmidt says. "I didn't realize he was going to try to change the way computer scientists thought about computing. That's a much more ambitious goal."
ONE-WAY STREET
For small companies and entrepreneurs, clouds mean opportunity—a leveling of the playing field in the most data-intensive forms of computing. To date, only a select group of cloud-wielding Internet giants has had the resources to scoop up huge masses of information and build businesses upon it. Our words, pictures, clicks, and searches are the raw material for this industry. But it has been largely a one-way street. Humanity emits the data, and a handful of companies—the likes of Google, Yahoo! (YHOO), or Amazon.com (AMZN)—transform the info into insights, services, and, ultimately, revenue.
This status quo is already starting to change. In the past year, Amazon has opened up its own networks of computers to paying customers, initiating new players, large and small, to cloud computing. Some users simply park their massive databases with Amazon. Others use Amazon's computers to mine data or create Web services. In November, Yahoo opened up a cluster of computers—a small cloud—for researchers at Carnegie Mellon University. And Microsoft (MSFT) has deepened its ties to communities of scientific researchers by providing them access to its own server farms. As these clouds grow, says Frank Gens, senior analyst at market research firm IDC, "A whole new community of Web startups will have access to these machines. It's like they're planting Google seeds." Many such startups will emerge in science and medicine, as data-crunching laboratories searching for new materials and drugs set up shop in the clouds.
For clouds to reach their potential, they should be nearly as easy to program and navigate as the Web. This, say analysts, should open up growing markets for cloud search and software tools—a natural business for Google and its competitors. Schmidt won't say how much of its own capacity Google will offer to outsiders, or under what conditions or at what prices. "Typically, we like to start with free," he says, adding that power users "should probably bear some of the costs." And how big will these clouds grow? "There's no limit," Schmidt says. As this strategy unfolds, more people are starting to see that Google is poised to become a dominant force in the next stage of computing. "Google aspires to be a large portion of the cloud, or a cloud that you would interact with every day," the CEO says. The business plan? For now, Google remains rooted in its core business, which gushes with advertising revenue. The cloud initiative is barely a blip in terms of investment. It hovers in the distance, large and hazy and still hard to piece together, but bristling with possibilities.
Changing the nature of computing and scientific research wasn't at the top of Bisciglia's agenda the day he collared Schmidt. What he really wanted, he says, was to go back to school. Unlike many of his colleagues at Google, a place teeming with PhDs, Bisciglia was snatched up by the company as soon as he graduated from the University of Washington, or U-Dub, as nearly everyone calls it. He'd never been a grad student. He ached for a break from his daily routines at Google—the 10-hour workdays building search algorithms in his cube in Building 44, the long commutes on Google buses from the apartment he shared with three roomies in San Francisco's Duboce Triangle. He wanted to return to Seattle, if only for one day a week, and work with his professor and mentor, Ed Lazowska. "I had an itch to teach," he says.
He didn't think twice before vaulting over the org chart and batting around his idea directly with the CEO. Bisciglia and Schmidt had known each other for years. Shortly after landing at Google five years ago as a 22-year-old programmer, Bisciglia worked in a cube across from the CEO's office. He'd wander in, he says, drawn in part by the model airplanes that reminded him of his mother's work as a United Airlines (UAUA) hostess. Naturally he talked with the soft-spoken, professorial CEO about computing. It was almost like college. And even after Bisciglia moved to other buildings, the two stayed in touch. ("He's never too hard to track down, and he's incredible about returning e-mails," Bisciglia says.)
On the day they first discussed Google 101, Schmidt offered one nugget of advice: Narrow down the project to something Bisciglia could have up and running in two months. "I actually didn't care what he did," Schmidt recalls. But he wanted the young engineer to get feedback in a hurry. Even if Bisciglia failed, he says, "he's smart, and he'd learn from it."
To launch Google 101, Bisciglia had to replicate the dynamics and a bit of the magic of Google's cloud—but without tapping into the cloud itself or revealing its deepest secrets. These secrets fuel endless speculation among computer scientists. But Google keeps much under cover. This immense computer, after all, runs the company. It automatically handles search, places ads, churns through e-mails. The computer does the work, and thousands of Google engineers, including Bisciglia, merely service the machine. They teach the system new tricks or find new markets for it to invade. And they add on new clusters—four new data centers this year alone, at an average cost of $600 million apiece.
In building this machine, Google, so famous for search, is poised to take on a new role in the computer industry. Not so many years ago scientists and researchers looked to national laboratories for the cutting-edge research on computing. Now, says Daniel Frye, vice-president of open systems development at IBM, "Google is doing the work that 10 years ago would have gone on in a national lab."
How was Bisciglia going to give students access to this machine? The easiest option would have been to plug his class directly into the Google computer. But the company wasn't about to let students loose in a machine loaded with proprietary software, brimming with personal data, and running a $10.6 billion business. So Bisciglia shopped for an affordable cluster of 40 computers. He placed the order, then set about figuring out how to pay for the servers. While the vendor was wiring the computers together, Bisciglia alerted a couple of Google managers that a bill was coming. Then he "kind of sent the expense report up the chain, and no one said no." He adds one of his favorite sayings: "It's far easier to beg for forgiveness than to ask for permission." ("If you're interested in someone who strictly follows the rules, Christophe's not your guy," says Lazowska, who refers to the cluster as "a gift from heaven.")
A FRENETIC LEARNER
On Nov. 10, 2006, the rack of computers appeared at U-Dub's Computer Science building. Bisciglia and a couple of tech administrators had to figure out how to hoist the 1-ton rack up four stories into the server room. They eventually made it, and then prepared for the start of classes, in January.
Bisciglia's mother, Brenda, says her son seemed marked for an unusual path from the start. He didn't speak until age 2, and then started with sentences. One of his first came as they were driving near their home in Gig Harbor, Wash. A bug flew in the open window, and a voice came from the car seat in back: "Mommy, there's something artificial in my mouth."
At school, the boy's endless questions and frenetic learning pace exasperated teachers. His parents, seeing him sad and frustrated, pulled him out and home-schooled him for three years. Bisciglia says he missed the company of kids during that time but developed as an entrepreneur. He had a passion for Icelandic horses and as an adolescent went into business raising them. Once, says his father, Jim, they drove far north into Manitoba and bought horses, without much idea about how to transport the animals back home. "The whole trip was like a scene from one of Chevy Chase's movies," he says. Christophe learned about computers developing Web pages for his horse sales and his father's luxury-cruise business. And after concluding that computers promised a brighter future than animal husbandry, he went off to U-Dub and signed up for as many math, physics, and computer courses as he could.
In late 2006, as he shuttled between the Googleplex and Seattle preparing for Google 101, Bisciglia used his entrepreneurial skills to piece together a sprawling team of volunteers. He worked with college interns to develop the curriculum, and he dragooned a couple of Google colleagues from the nearby Kirkland (Wash.) facility to use some of their 20% time to help him teach it. Following Schmidt's advice, Bisciglia worked to focus Google 101 on something students could learn quickly. "I was like, what's the one thing I could teach them in two months that would be useful and really important?" he recalls. His answer was "MapReduce."
Bisciglia adores MapReduce, the software at the heart of Google computing. While the company's famous search algorithms provide the intelligence for each search, MapReduce delivers the speed and industrial heft. It divides each task into hundreds, or even thousands, of tasks, and distributes them to legions of computers. In a fraction of a second, as each one comes back with its nugget of information, MapReduce quickly assembles the responses into an answer. Other programs do the same job. But MapReduce is faster and appears able to handle near limitless work. When the subject comes up, Bisciglia rhapsodizes. "I remember graduating, coming to Google, learning about MapReduce, and really just changing the way I thought about computer science and everything," he says. He calls it "a very simple, elegant model." It was developed by another Washington alumnus, Jeffrey Dean. By returning to U-Dub and teaching MapReduce, Bisciglia would be returning this software "and this way of thinking" back to its roots.
There was only one obstacle. MapReduce was anchored securely inside Google's machine—and it was not for outside consumption, even if the subject was Google 101. The company did share some information about it, though, to feed an open-source version of MapReduce called Hadoop. The idea was that, without divulging its crown jewel, Google could push for its standard to become the architecture of cloud computing.
The team that developed Hadoop belonged to a company, Nutch, that got acquired. Oddly, they were now working within the walls of Yahoo, which was counting on the MapReduce offspring to give its own computers a touch of Google magic. Hadoop remained open source, though, which meant the Google team could adapt it and install it for free on the U-Dub cluster.
Students rushed to sign up for Google 101 as soon as it appeared in the winter-semester syllabus. In the beginning, Bisciglia and his Google colleagues tried teaching. But in time they handed over the job to professional educators at U-Dub. "Their delivery is a lot clearer," Bisciglia says. Within weeks the students were learning how to configure their work for Google machines and designing ambitious Web-scale projects, from cataloguing the edits on Wikipedia to crawling the Internet to identify spam. Through the spring of 2007, as word about the course spread to other universities, departments elsewhere started asking for Google 101.
Many were dying for cloud knowhow and computing power—especially for scientific research. In practically every field, scientists were grappling with vast piles of new data issuing from a host of sensors, analytic equipment, and ever-finer measuring tools. Patterns in these troves could point to new medicines and therapies, new forms of clean energy. They could help predict earthquakes. But most scientists lacked the machinery to store and sift through these digital El Dorados. "We're drowning in data," said Jeannette Wing, assistant director of the National Science Foundation.
BIG BLUE LARGESSE
The hunger for Google computing put Bisciglia in a predicament. He had been fortunate to push through the order for the first cluster of computers. Could he do that again and again, eventually installing mini-Google clusters in each computer science department? Surely not. To extend Google 101 to universities around the world, the participants needed to plug into a shared resource. Bisciglia needed a bigger cloud.
That's when luck descended on the Googleplex in the person of IBM Chairman Samuel J. Palmisano. This was "Sam's day at Google," says an IBM researcher. The winter day was a bit chilly for beach volleyball in the center of campus, but Palmisano lunched on some of the fabled free cuisine in a cafeteria. Then he and his team sat down with Schmidt and a handful of Googlers, including Bisciglia. They drew on whiteboards and discussed cloud computing. It was no secret that IBM wanted to deploy clouds to provide data and services to business customers. At the same time, under Palmisano, IBM had been a leading promoter of open-source software, including Linux. This was a key in Big Blue's software battles, especially against Microsoft. If Google and IBM teamed up on a cloud venture, they could construct the future of this type of computing on Google-based standards, including Hadoop.
Google, of course, had a running start on such a project: Bisciglia's Google 101. In the course of that one day, Bisciglia's small venture morphed into a major initiative backed at the CEO level by two tech titans. By the time Palmisano departed that afternoon, it was established that Bisciglia and his IBM counterpart, Dennis Quan, would build a prototype of a joint Google-IBM university cloud.
Over the next three months they worked together at Google headquarters. (It was around this time, Bisciglia says, that the cloud project evolved from 20% into his full-time job.) The work involved integrating IBM's business applications and Google servers, and equipping them with a host of open-source programs, including Hadoop. In February they unveiled the prototype for top brass in Mountain View, Calif., and for others on video from IBM headquarters in Armonk, N.Y. Quan wowed them by downloading data from the cloud to his cell phone. (It wasn't relevant to the core project, Bisciglia says, but a nice piece of theater.)
The Google 101 cloud got the green light. The plan was to spread cloud computing first to a handful of U.S. universities within a year and later to deploy it globally. The universities would develop the clouds, creating tools and applications while producing legions of computer scientists to continue building and managing them.
Those developers should be able to find jobs at a host of Web companies, including Google. Schmidt likes to compare the data centers to the prohibitively expensive particle accelerators known as cyclotrons. "There are only a few cyclotrons in physics," he says. "And every one if them is important, because if you're a top-flight physicist you need to be at the lab where that cyclotron is being run. That's where history's going to be made; that's where the inventions are going to come. So my idea is that if you think of these as supercomputers that happen to be assembled from smaller computers, we have the most attractive supercomputers, from a science perspective, for people to come work on."
As the sea of business and scientific data rises, computing power turns into a strategic resource, a form of capital. "In a sense," says Yahoo Research Chief Prabhakar Raghavan, "there are only five computers on earth." He lists Google, Yahoo, Microsoft, IBM, and Amazon. Few others, he says, can turn electricity into computing power with comparable efficiency.
All sorts of business models are sure to evolve. Google and its rivals could team up with customers, perhaps exchanging computing power for access to their data. They could recruit partners into their clouds for pet projects, such as the company's clean energy initiative, announced in November. With the electric bills at jumbo data centers running upwards of $20 million a year, according to industry analysts, it's only natural for Google to commit both brains and server capacity to the search for game-changing energy breakthroughs.
What will research clouds look like? Tony Hey, vice-president for external research at Microsoft, says they'll function as huge virtual laboratories, with a new generation of librarians—some of them human—"curating" troves of data, opening them to researchers with the right credentials. Authorized users, he says, will build new tools, haul in data, and share it with far-flung colleagues. In these new labs, he predicts, "you may win the Nobel prize by analyzing data assembled by someone else." Mark Dean, head of IBM's research operation in Almaden, Calif., says that the mixture of business and science will lead, in a few short years, to networks of clouds that will tax our imagination. "Compared to this," he says, "the Web is tiny. We'll be laughing at how small the Web is." And yet, if this "tiny" Web was big enough to spawn Google and its empire, there's no telling what opportunities could open up in the giant clouds.
It's a mid-November day at the Googleplex. A jetlagged Christophe Bisciglia is just back from China, where he has been talking to universities about Google 101. He's had a busy time, not only setting up the cloud with IBM but also working out deals with six universities—U-Dub, Berkeley, Stanford, MIT, Carnegie Mellon, and the University of Maryland—to launch it. Now he's got a camera crew in a conference room, with wires and lights spilling over a table. This is for a promotional video about cloud education that they'll release, at some point, on YouTube (GOOG).
Eric Schmidt comes in. At 52, he is nearly twice Bisciglia's age, and his body looks a bit padded next to his protégé's willowy frame. Bisciglia guides him to a chair across from the camera and explains the plan. They'll tape the audio from the interview and then set up Schmidt for some stand-alone face shots. "B-footage," Bisciglia calls it. Schmidt nods and sits down. Then he thinks better of it. He tells the cameramen to film the whole thing and skip stand-alone shots. He and Bisciglia are far too busy to stand around for B footage.
Baker is a senior writer for BusinessWeek in New York .
About this blog
This blog is all about Cloud computing.
Cloud computing is a computing paradigm shift where computing is moved away from personal computers or an individual server to a “cloud” of computers. Users of the cloud only need to be concerned with the computing service being asked for, as the underlying details of how it’s achieved are hidden. This method of distributed computing is done through pooling all computer resources together and being managed by software rather than a human.
The services being requested of a cloud are not limited to using web applications, but can also be IT management tasks such as requesting of systems, a software stack or a specific web appliance.
This simplifies IT management as well as increases efficiencies of system resources. IT administrators no longer need to install software and manually setup all the systems, but instead they have management software do this. Resources are used more efficiently as computers can be consolidated to be used for more tasks. This ensures underutilized systems do not sit idle.
Cloud computing ArchitectureThe architecture behind cloud computing is a massive network of "cloud servers" interconnected as if in a grid running in parallel, sometimes using the technique of virtualization to maximize computing power per server.
It is made up of a front-end interface which allows a user to select a service from a catalog. This request gets passed to the system management which finds the correct resources, and then calls the provisioning services which carves out resources in the cloud. The provisioning service may deploy the requested stack or web application as well.
User Interaction Interface: This is how users of the cloud interface with the cloud to request services.
Services Catalog: This is the list of services which a user could request.
System Management: This is the piece which manages the computer resources available.
Provisioning Tool: This tool carves out the systems from the cloud to deliver on the requested service. It may also deploy the required images.
Monitoring & Metering: This optional piece tracks the usage of the cloud so the resources used can be attributed to a certain user.
Servers: The servers get managed by the system management tool. They can be either virtual or real.
Cloud storage is a model of networked data storage where data is stored on multiple virtual servers, generally hosted by third parties, rather than being hosted on dedicated servers. Hosting companies operate large data centers; and people who require their data to be hosted buy or lease storage capacity from them and use it for their storage needs. The data center operators, in the background, virtualize the resources according to the requirements of the customer and expose them as virtual servers, which the customers can themselves manage. Physically, the resource may span across multiple servers.
Cloud services are all Web services offered via Cloud computing.










