Tag Archives: tech

Ubuntu pre-release updates are good.

Since upgrading my laptop (a Dell Latitude 420) last month to Gutsy (code name for Ubuntu 7.10), my wireless network has been crapping out a lot.  I try to never shut down; I use suspend instead.  But one out of five times when I wake the laptop back up after being suspended, the wireless connection breaks beyond repair.  Restarting the network services doesn’t fix it, reloading the driver doesn’t fix it – Only a reboot fixes it.

So after much frustration, I configured Update Manager to install pre-release updates in addition to the standard updates to see if there was a fix that hasn’t officially been released yet.  Within minutes it said there was a pre-release update to Network Manager.  I installed it… problem solved.  It has been 3 days and the wireless network has been perfect.  No weirdness from the 100 or so other pre-release updates it’s pulled down either.

Conclusion: pre-release updates are good.

System > Administration > Synaptic Package Manager > Settings > Repositories > Updates > Pre-release updates (gutsy-proposed)

Amazon’s larger EC2 servers

Amazon now has bad-ass EC2 virtual servers.  I am curious how powerful these new options really are though.  We use Amazon’s single core $0.10/hr EC2 instances within our data backups system to do localized processing on the backup data store.  They work fine for this purpose.  However, we have tried in the past to run a CPU intense app on EC2, but that brought our EC2 instances to their knees compared to "equivalent" real servers.

For example, we toyed with the idea of building a spam filtering overflow system on EC2 to dynamically add capacity to our spam filtering cluster here at Rackspace as a precautionary measure if our customers’ inbound mail volume were to ever spike beyond what our real servers could handle.  We actually got as far as coding the complete system for scaling-out the appropriate number of EC2 virtual spam filtering servers, rerouting sub-sets of our mail stream, and scaling-in after the traffic spike ended.  However we never could get this system to perform the way we needed it to.  The EC2 instances could not process near the amount of mail needed to make this cost efficient. With the CPU pegged it took 4 EC2 virtual servers to equal the spam processing power of one real single CPU Athlon 3200+ server with 1 GB RAM.

So we shelved this project and just added a ton of dual-opteron boxes in order to be able to sustain large traffic spikes.

Amazon says that their $0.10/hr EC2 instances are equivalent to an "early-2006 1.7 GHz Xeon processor", but for our app one of these EC2 instances was equivalent to 1/4 of a 2005 Athlon processor.  So their 4-core instance should now just about equal a single-cpu Athlon box from our perspective.  But at $0.40/hr the cost is the same as before.

Disclaimer: We did not spend a lot of time fine tuning our EC2 image, so perhaps we could have seen better performance if we did.  However, we built our image by starting with their standard Red Hat base image, which is what the majority of EC2 customers run.

Calendar trick for Webmail + Blackberry users

I use webmail and Blackberry as my primary email clients.  And I am a heavy user of the webmail shared calendaring system.  The webmail calendar does not yet sync to Blackberry devices, so here is how I "sync"…

A feature of Blackberry is that it will automatically add all meeting invitations that are received via IMAP to your Blackberry calendar.  So all invitations that I receive from other people already automatically appear on my Blackberry calendar.  In order to add my own events to both calendars, all I do is add it via webmail and invite myself to the event… it then automatically shows up on my Blackberry calendar.

Prediction: meeting invite spam

My notes from the Google Scalability Conference

No, I didn’t go to it.  I had planned to, but then a more important family (drinking) event came up.  But I spent the past few days watching all of the session videos that are on Google Video.  Most of these are very informative and worth watching.  The only one I couldn’t find was Werner Vogels’ talk.  If anyone knows where I can find that, please let me know.

Abstractions for Handling Large Datasets by Jeff Dean (Google)
– Google always needs more computers than it has available
– Computing platform (curently use dual cpu dual core, cheap disks, still cheaper to talk within a rack than across racks)
– System Infrastructure (GFS, MapReduce, BigTable)
– GFS: 200+ clusters of up to 5000 machines each, some with 5+ PB file systems and 40 GB/s read/write load
– MapReduce/BigTable: server failures handled just like GFS node failures.  Remaining nodes each take a small percentage of the failed worker’s MapReduce tasks or BigTable tables.
– BigTable: 500 BigTable instances running; 70 in production; largest stores 3000 TB of data; working on a unified multi data center BigTable implementation
– The right hardware and software infrastructure allows small product teams to accomplish large things.

Scaling Google for Every User by Marissa Mayer (Google)
– biggest thing they learned from early A/B testing was that speed absolutely matters
– Google Experimental
– the need to understand what people mean, not what they say
– internationall searches search English content by auto-translating their native lang to
  English, performing the search, then translating the result back t their native lang
– scaling Universal Search is hard (she defined the problems they had to solve, but not the solution)
– Universal Search was a major infrastructure and core engine change.
– personalization (location-specific and user-specific page-rank)
– some of the best product feedback comes from Google employees

MapReduce Used on Large Data Sets by Barry Brumitt (Google)
– dumb jokes
– interesting problems that can be solved by MapReduce which the authors never expected
– mostly geographic data examples
– mentioned that Hadoop is very similar and appears to be just as useful
– multiple MapReduces can run on same machine at same time; users request a "memory budget" and processes get killed if exceeded
– only use MapReduce for data processes; keep data serving as simple and cached as possible
– awesome slide on Google’s dev processes:
  single source code tree
  anyone can check out Gmail source and compile it!
  mandatory code reviews before commit
  40 page coding style guidelines that must be followed
  developers carry pagers when their code is released, until it proves to be stable

Building a Scalable Resource Management by Khalid Ahmed (Platform)
– very enterprisish grid computing
– lots of acronyms and product names
– got bored and stopped watching

SCTPs Reliability and Fault Tolerance by Brad Penoff, Mike Tsai, and Alan Wagner
– Stream Control Transmission Protocol = another tranport protocol like TCP and UDP
– can use multiple NICs with different IPs and routes for failover, and concurrent throughput if using CMT
– message based, like UDP
– multiple concurrent streams allows messages to arrive out of order (ordered just on each stream), rather than just one ordered stream like TCP
– selective acknowledgment ("SACK")
– stronger checksum (TCP = 16 bit, SCTP = 32 bit) = CRC32
– supports standard sockets API makes porting existing apps simple: TCP-like 1-to-1, UDP-like 1-to-many
– extended sockets API supports the new features above
– same congestion control as TCP, so still has "slow-start" problem
– SCTP can simplify buffering and prefetching requirements needed to optimize task flow in large MapReduce-style worker systems
– SCTP’s SACK and CRC32 makes it more reliable than TCP over lossy connections
– Already available in Linux (experimental), FreeBSD/OSX, Solaris 10, Cisco Routers; not Windows

Lessons In Building Scalable Systems by Reza Behforooz (Google)
– Google Talk architecture
– presence is the big load contributor, not number of messages (# presence packets = ConnectedUsers * BuddyListSize * StateChanges)
– how they handled sudden large increases in traffic such as when they integrated with GMail and Orkut
– live load tests by using hidden GTalk logins from Gmail to test load
– users tied to specific backend "shards" (machines?)
– Gmail and Orkut don’t know which data center GTalk servers are in, or any knowledge about it’s infrastructure really. Communication complexities abstracted into a set of gateways.  This is used in a lot of Google apps.  Friggin cool.
– more mention of epoll greatness for managing lots of open TCP connections
– work hard to smooth out load spikes, prevent cascading problems (back off from busy servers, don’t acceopt work when sick), isolation from other apps
– measure everything in live system
– single userbot can spike the load on a server?? .. but it sounds like they smooth out the spikes by backing off other traffic from that server
– experimental framework to enable beta features for a small # of users
– all of their GTalk team has access to the production machines, and that is good.  And this is true all over Google. wtf?
– bad things do happen from everyone having access to the production machines, particularly from "experiments" done by engineers, but they try to minimize it.  The experimental framework probably minimizes the affect on masses.

YouTube Scalability by Cuong Do Cuong
– Pre-Google acquisition (11/2006): 2 sysadmins, 2 scalability architects, 2 feature developers, 2 network engineers, 1 DBA
– most time spent just trying to stay alive: hack a solution to fix a bottleneck, sleep, find another bottleneck, repeat
– still have scaling issues while at Google; new bottlenecks discovered almost daily
– SUSE Linux, apache, python, psyco, C extensions, lots of caching
– each video is tied to a specific "mini-cluster"
– swiched from apache httpd to lighttpd; a lot faster; again somebody mentioned epoll is dope
– requests for most popular content go to CDN, lightly viewed content go to a YouTube data center
– serving thumbnails is difficult: lart # of requests, random seeks for small objects
– thumbnail servers: originally used ext3, then reiserfs
– thumbnail servers: first apache, then added squid front-ends, then lighttpd, now BigTable
– squid performance degraded over time, needed frequest restarts
– patched lightpd (/aio) main thread for fast stuff: epoll and reading from mem, then deligated to worker threads for disk access
– databases: stores just meta data, started with MySQL (one master, one slave), optimizations: memcached, batched jobs, added read slaves
– MySQL: replica lag solved by "cache priming" by selecting rows right before the updates, eliminating most cache misses
– MySQL: to stay alive during bad times, put popular video meta data in reliable cluster and less popular in less reliable cluster
– five hardware RAID 1 volumes with a Linux software RAID 0 on top, is 20% faster than a hardward RAID 10, when using identical hardware – because of oportunities for better IO scheduling in Linux
– MySQL: finally partitioned into several databases by user
– videos partitioned between data centers, not replicated, arbetrary placement (not geographic)
– Google BigTable is replicated between data centers (this is the first confirmation I have heard of this)
– in general they solved lots of interesting scaling problems that are not all that different from the problems we try to solve

Lustre File System by Peter Braam
– 100 people, virtual company, no VC $
– runs on 15% of top500 fastest computers, 6 of top 10
– features: POSIX, HA, security, modifiable, multi data center (future)
– huge variety of hardware in use with Lustre
– stripes files across many servers running "ldiskfs" file system
– ldiskfs is just a Linux ext4 partition with special use (Solaris ZFS version coming soon)
– they do a lot of development work with ext4 community, and soon possibly porting ZFS to Linux
– 50% of Lustre code is networking related;can 90% saturate a GigE link (118MBps)
– future: multiple geographically dispersed Lustre clusters that talk amoungst each other
– concurrently writing multiple files in older ext3 yields poor IO, newer ext3 (2.6.x kernel) and now ext4 fix this
– new block allocator in ext4 (almost in kernel) focused on dealing with writing small files in 1-4MB chunks; similar optimizations in ReiserFS but without the unnecessarily complicated b-tree madness
– large memory and caching on meta data servers is good
– clients get immediate "success" for writes, but in the background wait for commit callback else replay the transaction
– meta data stored on modified file system, hashes and attributes in file path make it as fast as normal file access
– most customers are US dept defense, large mix of random IO and sequential IO
– we can buy a large Lustre cluster for only $100 million
– future throughput and scaling problems that the US Govt wants Lustre to solve is amazing

VeriSign’s Global DNS Infrastructure by Patrick Quaid
– overview of how recursive DNS queries work
– custom DNS app called ATLAS (Advanced Transaction Look-up and Signaling)
– .com/.net registry is stored in traditional replicated Oracle database
– DNS system not tied to registry database
– load balancing: BGP ipanycast (single IP to for 30+ locations), local load balancing
– fanatical about simplicity: everything is modular, small, isolated
– no spare servers or spare sites; every server takes load
– plan for 100 physical sites, mostly small, can quickly remove sites from ipanycast and frequently do
– have way more servers than are needed
– claims to have second DNS server code running live next to ATLAS in case ATLAS dies. I don’t believe him… probably just spinning the fact that their legacy code hasn’t been shut off yet, but they’d never fail back to it.
– 4 million updates per day; updates pushed to all servers every 15 seconds
– updates happen via a master node that pulls updates crunches them, then ships a file to 1 node at each data center, then each data center’s main node updates other local nodes
– monitoring GUI: 265K queries per second (on this Saturday); noon eastern time has peak traffic; shows top query domains; lag time for updates
– ATLAS also runs managed DNS, cert revocation, WHOIS for .com
– they have "VIP Domains" that get a lot of traffic and have extra security precautions
– resolver architecture has front-end server that receives/filters/bundles queries and sends them to backend ATLAS servers

Scalable Test Selection Using Source Code Deltas by Ryan Gerard (Symantec)
– the audio was off by 30 sec on this video and it seemed less interesting to me than the others, so I didn’t watch it.

My Amazon S3 slides from ISPCON

On Wednesday I gave a presentation at ISPCON in Orlando, covering how we built an email data backup system on top of Amazon S3.

Thanks Mark, for coming along with me as my technical backup…  and for leading this project.

Long SPF record

We recently added a new IP range for our San Antonio data center to our SPF record, which many  of our customers’ domains reference.  This however increased the string length to 134 characters, and tinydns splits the string when it is larger than 127 characters.  So our SPF record looked like this…

emailsrvr.com txt "v=spf1 ip4:207.97.227.208/28 ip4:207.97.245.0/24 ip4:204.119.252.0/24 ip4:206.158.104.0/22 ip4:64.49.219.0/28 ip4:66.216.121.0/" "24 ~all"

Technically this is okay, because receivers are supposed to concat these multiple strings together.  However, apparently not all SPF systems are this smart.  Including the popular SPF checker at DNSStuff.com.  So we are changing it to make these systems happy.

There are two IP ranges that we no longer use, which we are taking out of the record… putting the new string length at 92.  But as we grow I am sure we will approach 127 chars again.  At that point we will probably break our SPF record into a nested set of includes, like hotmail does…

Top: hotmail.com txt "v=spf1 include:spf-a.hotmail.com include:spf-b.hotmail.com include:spf-c.hotmail.com include:spf-d.hotmail.com ~all"

Example include: spf-a.hotmail.com txt "v=spf1 ip4:209.240.192.0/19 ip4:65.52.0.0/14 ip4:131.107.0.0/16 ip4:157.54.0.0/15 ip4:157.56.0.0/14 ip4:157.60.0.0/16 ip4:167.220.0.0/16 ip4:204.79.135.0/24 ip4:204.79.188.0/24 ip4:204.79.252.0/24 ip4:207.46.0.0/16 ip4:199.2.137.0/24 ~all"