From BIND to dnscache

After several years of running BIND9 on our DNS caching servers we
have finally ditch it and switch to D. J. Bernstein’s dnscache.   On an average day
our SMTP and Spam Filtering servers send 1400 queries per second to
each of our DNS servers during peak hours.  We made the switch because as we’ve grown we have seen more reliability, performance and general weirdness
issues with BIND.

Most notably, when the BIND cache would reach about 250 MB, its
performance deteriorated noticeably.  It would respond slowly and even
drop queries.  I have heard this is caused by BIND’s internal data
structures not efficiently getting rid of old cache records.  Instead
BIND tries to cache every record until it expires and when it does
reach some internally calculated limit, BIND starts to discard new
cache records instead of old records.  This causes the server’s
performance to take a nose dive, and causes our pagers to go off…. Time to
run "service named stop; sleep 2; service named start" again!

Also, BIND didn’t efficiently cache records from our rbldnsd servers
behind it.  We could never really figure out why so many requests were
reaching rbldnsd and not hitting the BIND cache.  Now with dnscache, we
have a good view of exactly what it is doing and have fine tuned the
SOAs in rbldnsd so that dnscache caches our spam DB lookups exactly how
we want it.  No more weirdness going on behind the scenes.

Mr. Bernstein has a lot of nasty things to say about BIND.  Don’t believe all of his hype, but do trust the fact that DJB’s code is
much simpler, more reliable and possibly more powerful than BIND.  BIND
is overkill for almost every use.  It tries to be all things for all
systems, whereas the DJB keeps things simple and provides a different server for
each purpose
.  I
like simplicity.  The install process was little bit awkward on a Linux system though,
with the daemon tools and stupid errno patch.

FYI our dnscache servers are AMD Athlon 3200s with 1GB RAM, and they
each are handling 1400 queries per second using only 15% CPU.  Currently
we have a 100 MB cache size, but we are still tuning that.

Thanks Sober.Y

The largest virus outbreak ever occurred again this week.  Our cluster of ClamAV servers handled it just fine, but that is not what this post is about.  Virus outbreaks like this cause a swarm of backscatter mail.  Backscatter is when innocent mailboxes are flooded with undeliverable mail notifications, because of the fact that viruses forge random sender addresses.  To combat this, we discard virus notifications from other servers, because ALL of them are bogus.  Viruses forge the sender address… So all you mail administrators out there – stop bouncing this crap back to the sender!

Blocking virus bounces based on Subject and other headers has worked reasonably well, but it does not block backscatter from simple SMTP rejects, because those bounces don’t contain pretty "you sent a virus" subject lines.  It also doesn’t block backscatter caused by bounced spam, because spammers love to use sender addresses as well.

The high volume of backscatter seen this week caused us to look deeper for a way to block this stuff.  And we found it…  We now tag as spam any bounce message where the original email was not sent from our email system.

The rule is: (1) if the email is from a null sender, and (2) if the email has a bounce-style Subject or From header (such as "Subject: Undeliverable Mail" or "From: Mail Delivery Subsystem"), and (3) if the body of the message does not contain our servers’ Received headers – tag it as spam.

Now our email hosting customers have even cleaner inboxes, and you can thank Sober.Y.

Logic Behind Our Server Names

At Webmail.us we name all of our servers using hostnames that fall under a very generic domain name.  I am not going to list that domain name here so that Google doesn’t index it in this blog post, but I’m sure you could figure it out if you wanted to.

The reason we use such a generic domain is because a large percentage of our business comes through resellers.  And most of our resellers want their customers to think that the email service is entirely owned and operated by them – in fact we encourage this.  Our resellers own the relationship with their customer, and we have no desire to interfere with that.

So we help them hide the fact that Webmail.us is powering their email hosting service by using generic server names everywhere, even in our reverse DNS.  And if you were to check the WHOIS database for this domain you will see that all of the contact info references a third-party holding company named "Whois Privacy Protection Service, Inc."

Pretty slick, huh?

Postfix and Domain Aliases

If you run Postfix, you probably know that it is the best SMTP server on the planet.  I bet it does everything you want plus more, and you know that it has very few shortcomings.  If so, I agree with you.

However, there is one Postfix shortcoming that has been on our radar to solve for quite some time.  Postfix cannot reject unknown users within domains that are an alias for another domain (i.e. domain aliases).  Postfix stubbornly accepts mail for any random email address within a domain alias, and only after it has incorrectly accepted the bogus email, does it bounce it back to the sender.  This is because Postfix’s smtpd address validation occurs prior to address  rewrites, and thus it is only an approximation.  This is a problem because a directory harvest attack on a domain alias can fill up your mail queues with bounces.

Here is how Korey solved this problem last week:

Our Postfix MySQL database structure is very simple.  It contains valid addresses, aliases, domains and domain aliases, mapped to their real address.  Here is a simplified example:

+——————–+——————–+
| email              | virtual            |
+——————–+——————–+
| support@webmail.us | support@webmail.us |
+——————–+——————–+
| @webmailus.com     | @webmail.us        |
+——————–+——————–+

In the past, this would cause our system to accept mail for junk addresses at our domain alias such as "asdf@webmailus.com".  By modifying the virtual_alias_maps query as follows, our system now rejects these invalid addresses as it should.  Maybe one day Postfix will do address rewrites prior to address validation, which would allow us to remove this complex SQL query.  Until then, this hack works great for us.  It may help you too…

/etc/postfix/main.cf:
  virtual_alias_maps =
    proxy:mysql:/etc/postfix/mysql_virtual_gate.cf,

/etc/postfix/mysql_virtual_gate.cf:
  query = select virtual from users where email = ‘%s’
      and left(virtual,1) <> ‘@’
    union
    select virtual from users where email = (
      select concat(left(‘%s’,locate(‘@’,’%s’)-1), virtual) as rewritten
      from users where email = ‘@%d’ and left(virtual,1) = ‘@’
    ) and left(virtual,1) <> ‘@’
    limit 1

I couldn’t ask for a better email hosting system. Well almost :)

Today we had our quarterly board meeting.  One of the items discussed was our accomplishments with the recent upgrades to our email hosting system, as well as our plans for what’s to come.  Here are some details…

We started the design for these upgrades towards the end of last year.  The entire process took about 12 months, and it was completed on November 11.  The major objectives of the upgrades were:

  • to provide complete redundancy for every front-end and back-end application;
  • to split all applications into independent server clusters;
  • to reduce mail delivery times down to the sub ten second range, during all hours of the day;
  • to simplify how we scale each application;
  • to build everything using commodity hardware;
  • and most importantly, to migrate customers to the new architecture with minimal impact.

We hit all of these objectives dead on, and we are very proud of it.  We rolled out the upgrades much slower than originally planned, however that gave us time to measure system performance and make adjustments prior to bringing a large number of customers onto the new system.

And today [insert this post’s title here].

We have already begun our next two infrastructure initiatives:

  1. Measure Everything.  This project will enable us to track historical metrics for over 170 data points within our system.  We have codenamed this project "Rockefeller".  Have you ever heard this quote?
    1. "Measure everything of significance.  I swear this is true.  Anything that is measured and watched, improves."

  2. Extremely fast, virtually unlimited storage.  We were one of the first business email companies to offer gigabyte mailboxes, but we know you want more.  The demands placed on the inbox are growing exponentially, and email wasn’t design to be used in the manner in which people use it today.  A lot of companies try to offer bigger and bigger mailboxes, but the performance of those mailboxes suffer.  This is great news for companies who can solve this problem.  We feel we have an awesome approach, and it is codenamed "Mercury".