DreamHost is launching several public cloud offerings, and along the way
we have learned a lot of information. The goal of this talk is to share
some of the lessons, tips, and tricks we have learned while designing
public cloud architectures.
Talk Outline and Notes:
The problems: Scale, Speed, Monitoring, Uptime, Security, Cost.
The domains: Networking, Storage, Hypervisors.
Scale: The pervasive problem. There are obvious issues (Data Center
Size, Network Switching Architecture, etc), but there are also the
not-so-obvious problems: DNS zone sizes and rebuild times, ARP/ND table
sizes, growing beyond Ethernet VLANs, and multiplication factor on small
delays. IPv6 is a requirement, not a nice to have, because we're out of
IPv4.
Speed: Disk I/O, Memory I/O, Network I/O, and CPU time all matter.
Performance cannot be cloud washed any longer. Beyond the user focused
problems there are pressing concerns inside the provider as well: how
fast can you expand? Automation is a requirement here that may cause
some initial delays will pay off long term.
Monitoring: Start with simple service monitoring via Nagios, then go
deeper. Agents on every everything. "Graph all the things" (we use
graphite).
Uptime: Decouple everything. Have multiple paths. Maintenance windows
are a thing of the past. HA is no longer an option, it's a requirement.
DevOps is the start, but planning and testing are critical.
Security: IPv6 is not IPv4 with more addresses; there are solutions to
common problems (ND replaces ARP), and introduction of new ones (RA).
Standard shared hosting/colo security models no longer work. The barrier
to entry in the cloud is a few cents, not a contract and an account
manager. Providers have to be proactive about security: SPAM, traffic
patterns, and bad money.
Cost: Forget everything you know about traditional storage. NFS, iSCSI,
SAN: these all do too little for too much money. Thinks like Ceph and
other Open Source technologies are game changers. Don't jump to the
other end of the pool either: consumer SATA brings a whole different
world of pain (time-outs and retries cause massive and hidden
performance degradation). We're going for the middle of the road:
Enterprise SATA and Enterprise SAS.
We are trying to open a new world in cloud computing: open tech, open
standards, and talking in the open. We think cloud should no longer be a
black box, and we're willing to talk about it on the record.