This operation is typically split between two different
groups. One group uses data for presentation layers and for
the feeding of live requests. The other group does data
analytics for traffic, etc. A third group will also exist in
some cases to do work for "near time" responses. That data is
used to handle DOS attacks and other security related
- Unstructured (Images, Sound, etc) Serving
- Graph (Social Network information)
- System Image Backup System (this serves backups and possibly deployment for Jumpstarts)
- Image/Video Converters/Trans-coding
- XML Builders (RSS/etc)
- Graph Rebuilds
- Stats building
The trick to scaling is to create asyncronous actions. Typically
queue are used to setup jobs like the sending of email, the
transformation of images, and the harvesting of text. These falls
into jobs that "must be done" and jobs which are "if we lose it,
it does not matter". This is used for incoming data and serves as
a governor for most systems (aka to prevent self inflicted DOS).
- Email (typically bulk sending/reque'ing)
- Sharding Infrastructure
- HA Solution routing.
Traffic routing to correct software nodes (or static content
nodes). Will also typically handle the shuttling of SSL data to
different backends (see Pound server as an example). Either
Cisco/Linux style routing. The big key with this is
- Page Caches
- Object Caches
Asset software for deployment (Debiab/RHEL package repositories).
Puppet/CFEngine deployment systems.
XMPP is the favored solution at this point.
Typically this is the reason why MySQL has gotten used (and
why systems like MogileFS which do replication are also
commonly used). This should not be confused with local
replication, which is more of an HA/Scale out issue (see
Facebook's example using MySQL + Memcached as a geographical