My notes from the Google Scalability Conference

No, I didn’t go to it. I had planned to, but then a more important family (drinking) event came up. But I spent the past few days watching all of the session videos that are on Google Video. Most of these are very informative and worth watching. The only one I couldn’t find was Werner Vogels’ talk. If anyone knows where I can find that, please let me know.

Abstractions for Handling Large Datasets by Jeff Dean (Google)
– Google always needs more computers than it has available
– Computing platform (curently use dual cpu dual core, cheap disks, still cheaper to talk within a rack than across racks)
– System Infrastructure (GFS, MapReduce, BigTable)
– GFS: 200+ clusters of up to 5000 machines each, some with 5+ PB file systems and 40 GB/s read/write load
– MapReduce/BigTable: server failures handled just like GFS node failures. Remaining nodes each take a small percentage of the failed worker’s MapReduce tasks or BigTable tables.
– BigTable: 500 BigTable instances running; 70 in production; largest stores 3000 TB of data; working on a unified multi data center BigTable implementation
– The right hardware and software infrastructure allows small product teams to accomplish large things.

Scaling Google for Every User by Marissa Mayer (Google)
– biggest thing they learned from early A/B testing was that speed absolutely matters
– Google Experimental
– the need to understand what people mean, not what they say
– internationall searches search English content by auto-translating their native lang to
English, performing the search, then translating the result back t their native lang
– scaling Universal Search is hard (she defined the problems they had to solve, but not the solution)
– Universal Search was a major infrastructure and core engine change.
– personalization (location-specific and user-specific page-rank)
– some of the best product feedback comes from Google employees

MapReduce Used on Large Data Sets by Barry Brumitt (Google)
– dumb jokes
– interesting problems that can be solved by MapReduce which the authors never expected
– mostly geographic data examples
– mentioned that Hadoop is very similar and appears to be just as useful
– multiple MapReduces can run on same machine at same time; users request a "memory budget" and processes get killed if exceeded
– only use MapReduce for data processes; keep data serving as simple and cached as possible
– awesome slide on Google’s dev processes:
single source code tree
anyone can check out Gmail source and compile it!
mandatory code reviews before commit
40 page coding style guidelines that must be followed
developers carry pagers when their code is released, until it proves to be stable

Building a Scalable Resource Management by Khalid Ahmed (Platform)
– very enterprisish grid computing
– lots of acronyms and product names
– got bored and stopped watching

SCTPs Reliability and Fault Tolerance by Brad Penoff, Mike Tsai, and Alan Wagner
– Stream Control Transmission Protocol = another tranport protocol like TCP and UDP
– can use multiple NICs with different IPs and routes for failover, and concurrent throughput if using CMT
– message based, like UDP
– multiple concurrent streams allows messages to arrive out of order (ordered just on each stream), rather than just one ordered stream like TCP
– selective acknowledgment ("SACK")
– stronger checksum (TCP = 16 bit, SCTP = 32 bit) = CRC32
– supports standard sockets API makes porting existing apps simple: TCP-like 1-to-1, UDP-like 1-to-many
– extended sockets API supports the new features above
– same congestion control as TCP, so still has "slow-start" problem
– SCTP can simplify buffering and prefetching requirements needed to optimize task flow in large MapReduce-style worker systems
– SCTP’s SACK and CRC32 makes it more reliable than TCP over lossy connections
– Already available in Linux (experimental), FreeBSD/OSX, Solaris 10, Cisco Routers; not Windows

Lessons In Building Scalable Systems by Reza Behforooz (Google)
– Google Talk architecture
– presence is the big load contributor, not number of messages (# presence packets = ConnectedUsers * BuddyListSize * StateChanges)
– how they handled sudden large increases in traffic such as when they integrated with GMail and Orkut
– live load tests by using hidden GTalk logins from Gmail to test load
– users tied to specific backend "shards" (machines?)
– Gmail and Orkut don’t know which data center GTalk servers are in, or any knowledge about it’s infrastructure really. Communication complexities abstracted into a set of gateways. This is used in a lot of Google apps. Friggin cool.
– more mention of epoll greatness for managing lots of open TCP connections
– work hard to smooth out load spikes, prevent cascading problems (back off from busy servers, don’t acceopt work when sick), isolation from other apps
– measure everything in live system
– single userbot can spike the load on a server?? .. but it sounds like they smooth out the spikes by backing off other traffic from that server
– experimental framework to enable beta features for a small # of users
– all of their GTalk team has access to the production machines, and that is good. And this is true all over Google. wtf?
– bad things do happen from everyone having access to the production machines, particularly from "experiments" done by engineers, but they try to minimize it. The experimental framework probably minimizes the affect on masses.

YouTube Scalability by Cuong Do Cuong
– Pre-Google acquisition (11/2006): 2 sysadmins, 2 scalability architects, 2 feature developers, 2 network engineers, 1 DBA
– most time spent just trying to stay alive: hack a solution to fix a bottleneck, sleep, find another bottleneck, repeat
– still have scaling issues while at Google; new bottlenecks discovered almost daily
– SUSE Linux, apache, python, psyco, C extensions, lots of caching
– each video is tied to a specific "mini-cluster"
– swiched from apache httpd to lighttpd; a lot faster; again somebody mentioned epoll is dope
– requests for most popular content go to CDN, lightly viewed content go to a YouTube data center
– serving thumbnails is difficult: lart # of requests, random seeks for small objects
– thumbnail servers: originally used ext3, then reiserfs
– thumbnail servers: first apache, then added squid front-ends, then lighttpd, now BigTable
– squid performance degraded over time, needed frequest restarts
– patched lightpd (/aio) main thread for fast stuff: epoll and reading from mem, then deligated to worker threads for disk access
– databases: stores just meta data, started with MySQL (one master, one slave), optimizations: memcached, batched jobs, added read slaves
– MySQL: replica lag solved by "cache priming" by selecting rows right before the updates, eliminating most cache misses
– MySQL: to stay alive during bad times, put popular video meta data in reliable cluster and less popular in less reliable cluster
– five hardware RAID 1 volumes with a Linux software RAID 0 on top, is 20% faster than a hardward RAID 10, when using identical hardware – because of oportunities for better IO scheduling in Linux
– MySQL: finally partitioned into several databases by user
– videos partitioned between data centers, not replicated, arbetrary placement (not geographic)
– Google BigTable is replicated between data centers (this is the first confirmation I have heard of this)
– in general they solved lots of interesting scaling problems that are not all that different from the problems we try to solve

Lustre File System by Peter Braam
– 100 people, virtual company, no VC $
– runs on 15% of top500 fastest computers, 6 of top 10
– features: POSIX, HA, security, modifiable, multi data center (future)
– huge variety of hardware in use with Lustre
– stripes files across many servers running "ldiskfs" file system
– ldiskfs is just a Linux ext4 partition with special use (Solaris ZFS version coming soon)
– they do a lot of development work with ext4 community, and soon possibly porting ZFS to Linux
– 50% of Lustre code is networking related;can 90% saturate a GigE link (118MBps)
– future: multiple geographically dispersed Lustre clusters that talk amoungst each other
– concurrently writing multiple files in older ext3 yields poor IO, newer ext3 (2.6.x kernel) and now ext4 fix this
– new block allocator in ext4 (almost in kernel) focused on dealing with writing small files in 1-4MB chunks; similar optimizations in ReiserFS but without the unnecessarily complicated b-tree madness
– large memory and caching on meta data servers is good
– clients get immediate "success" for writes, but in the background wait for commit callback else replay the transaction
– meta data stored on modified file system, hashes and attributes in file path make it as fast as normal file access
– most customers are US dept defense, large mix of random IO and sequential IO
– we can buy a large Lustre cluster for only $100 million
– future throughput and scaling problems that the US Govt wants Lustre to solve is amazing

VeriSign’s Global DNS Infrastructure by Patrick Quaid
– overview of how recursive DNS queries work
– custom DNS app called ATLAS (Advanced Transaction Look-up and Signaling)
– .com/.net registry is stored in traditional replicated Oracle database
– DNS system not tied to registry database
– load balancing: BGP ipanycast (single IP to for 30+ locations), local load balancing
– fanatical about simplicity: everything is modular, small, isolated
– no spare servers or spare sites; every server takes load
– plan for 100 physical sites, mostly small, can quickly remove sites from ipanycast and frequently do
– have way more servers than are needed
– claims to have second DNS server code running live next to ATLAS in case ATLAS dies. I don’t believe him… probably just spinning the fact that their legacy code hasn’t been shut off yet, but they’d never fail back to it.
– 4 million updates per day; updates pushed to all servers every 15 seconds
– updates happen via a master node that pulls updates crunches them, then ships a file to 1 node at each data center, then each data center’s main node updates other local nodes
– monitoring GUI: 265K queries per second (on this Saturday); noon eastern time has peak traffic; shows top query domains; lag time for updates
– ATLAS also runs managed DNS, cert revocation, WHOIS for .com
– they have "VIP Domains" that get a lot of traffic and have extra security precautions
– resolver architecture has front-end server that receives/filters/bundles queries and sends them to backend ATLAS servers

Scalable Test Selection Using Source Code Deltas by Ryan Gerard (Symantec)
– the audio was off by 30 sec on this video and it seemed less interesting to me than the others, so I didn’t watch it.

Bill Boebel

My notes from the Google Scalability Conference

Leave a Reply Cancel reply