[Postfixbuch-users] Hochverfügbares Mailsystem

So Feb 8 05:25:52 CET 2009

Sehr interessant ist das System "slots and stores" von Fastmail.fm, 
einem sehr, sehr großen Provider mit excellenter Performance und 
mehreren 100 Millionen Mailboxen. Daran kann man lernen, wie es richtig 
geht, also ein System, was mal wirklich skaliert, hochverfügbar ist, und 
nicht unter Last zusammenbricht, wie ein RAID5,10 o.ä. .... Das System 
ist auch für andere hochverfügbare Serversysteme geeignet. Sehr 
interessant auch ist GlusterFS.

Have fun, Guido Stepken

Rob Müller von Fastmail.fm hierzu:

We don't use a murder setup. Two main reasons. 1) Murder wasn't very 
mature when we started 2) The main advantage murder gives you is a set 
of proxies (imap/pop/lmtp) to connect users to the appropriate backends, 
which we ended up using other software for, and a unified mailbox 
namespace if you want to do mailbox sharing, something we didn't really 
need either. Also the unifed mailbox needs a global mailboxes.db 
somewhere. As it was, because the skiplist backend mmaps the entire 
mailboxes.db file into memory, and we had multiple machines with 100M+ 
mailboxes.db files, I didn't really like the idea of dealing with a 
500M+ mailboxes.db file.

We don't use a shared SAN storage. When we started out we didn't have 
that much money, so purchasing an expensive SAN unit wasn't an option.

What we have has evolved over time to our current point. Basically we 
now have a hardware set that is quite nicely balanced with regard to 
spool IO vs metadata IO vs CPU, and a storage configuration that gives 
us replication with good failure capability, but without having to waste 
lots of hardware on just having replica machines.

IMAP/POP frontend - We used to use perdition, but have now changed to 
nginx (http://blog.fastmail.fm/?p=592). As you can read from the linked 
blog post, nginx is great.

LMTP delivery - We use a custom written perl daemon that forwards lmtp 
deliveries from postfix to the appropriate backend server. It also does 
the spam scanning, virus checking and a bunch of other in house stuff.

Servers - We use servers with attached SATA-to-SCSI RAID units with 
battery backed up caches. We have a mix of large drives for the email 
spool, and smaller faster drives for meta-data. That's the reason we 
sponsored the metapartition config options 
(http://cyrusimap.web.cmu.edu/imapd/changes.html).

Replication - We initial started with pairs of machines, half of each 
being a replica and half a master replicating between each other, but 
that meant on a failure, one machine became fully loaded with masters. 
masters take a much bigger IO hit than replicas. Instead we went with a 
system we calls "slots" and "stores". Each machine is divided into a set 
of "slots". "slots" from different machines are then paired as a 
replicated "store" with a master and replica. So say you have 20 slots 
per machine (half master, half replica), and 10 machines, then if one 
machine fails, on average you only have to distribute one more master 
slot to each of the other machines. Much better on IO. Some more details 
in this blog post on our replication trials... 
http://blog.fastmail.fm/?p=576

Yep, this means we need quite a bit more software to manage the setup, 
but now that it's done, it's quite nice and works well. For maintenance, 
we can safely fail all masters off a server in a few minutes, about 
10-30 seconds a store. Then we can take the machine down, do whatever we 
want, bring it back up, wait for replication to catch up again, then 
fail any masters we want back on to the server.

Unfortunately most of this software is in house and quite specific to 
our setup, it's not very "generic" (e.g. it assumes particular disk 
layouts and sizes, machines, database tables, hostnames, etc) to manage 
and track it all, so it's not something we're going to release.

Rob