The final day of QCon London 2014 is upon us, and as is usual the morning after the night before is never easy. In my case, having both missed my planned train home (and getting home an hour later as a result) as well as having mislaid my hat I decided to take a punt on this morning’s keynote on world 2.0. Apparently cheese was involved.
Lessons learned in scaling Twitter
Brian Degenhardt talked about some of the challenges when scaling Twitter. Initially developed as a monolithic application, by separating the system down into separate services meant that the components could be scaled independently from each other, both at the network layer but also the development layer. A MySQL database was used to hold everything with a rails app fronting the code.
The second generation twitter split up the monolith into different back-end
data storage layers, using
gizzard
as a sharding mechanism for MySQL databases and a
redis
instance for storing timeline information. Subsequent iterations replaced the
front-end components with services as well, which allowed more heavily used
services (such as the timeline and user tweet details) to be scaled
horizontally when needed.
Fortunately when a request comes in to Twitter the request is composable
into different services, and by using non-blocking futures it is possible
to scale these by compositing the results of a series of lookups into one
chain. By executing many requests in parallel but having that hidden by the
underlying futures framework means that the code is easy to understand whilst
still taking advantage of multiple threads.
When a write to Twitter happens, it gets split into a lookup of all the
subscribers and then writes a copy of each into all the timelines. This involves
a fan-out (lookup of affected destinations) and then a sharded write across
the database. For providers that are subscribed to the firehose, the feed
goes to them as well – around 30Mb/s in total. In the case of the
Lady Gaga
tweet replying to the Mars rover,
this resulted in a set conjunction of around 1.5 million and 40 million
followers, which took some time in the fan-out processing.
In terms of measuring the performance of Twitter, once the data stream starts moving into the tens of errors per second (99.999% uptime of 300,000 requests per second), it’s impossible to focus on individual events in terms of underlying analysis. By focussing on a statistical view of the events, it’s possible to get a much better picture of overall system health than if a single event is decoded.
To that extent, Twitter uses a distributed tracing system called
Zipkin
(demo) which performs tracing throughout the
lifetime of a request, based on Google’s
Dapper paper
which was written about but never open-sourced. As a result, Twitter
re-implemented it based on the content of the paper and open-sourced it
instead.
The underlying service calls use HTTP for external communication, and once inside the boundary use Thrift for passing RPC messages and using a standardised library that provides load balancing, service discovery and metrics generation and logging. These metrics are collected centrally and can be visualised using histograms or by using a view of the data points with percentiles that show when the system is suffering under a higher load.
Twitter open-sources most of their code for high performance services via their GitHub repositories under an Apache license.
Scaling at Netflix
Ruslan Meshenberg discussed
how Netflix leverages mutli-regions to increase availability. In fact, he
said that the talk was “Really about failure” and how it’s handled. rather
than anything else.
He showed a pyramid of failures, with top incidents being those that caused negative PR (mitigated by active/active pairs and game day practicing), customer service incidents (mitigated by better tools and practices), metrics impact (mitigated by data tagging and feature enable/disabling), and automated failover/recovery processes.
The key should be about how to respond to failure; with enough systems, eventually at least one will have a hardware-related failure with a disk or CPU dying, or natural events such as a lightening strike taking out power to a data centre. It’s possible that human error will be involved (configuration, bad code push etc.) but the key is to be able to recover quickly and painlessly from any events that may occur.
In the case of an Amazon ELB failure, in which the elastic load balancing configuration was lost, Netflix lost connectivity for a few hours on Xmas eve in 2012. To patch this immediately Isthmus was created to front-end all services coming in to ELB to provide a backup level of routing should ELB fail again. This was a spike solution aimed at providing solely ELB related traffic resilience; it grew into Netflix Zuul which provided a load-balancing and front-end re-routing for all of the (non-streaming) Netflix services. In conjunction with various DNS update layers fronted by denominator, which provides a cross-service DNS updating tool, any service can be taken off-line and recovered by updating either routing information or by updating DNS entires to point to a different location (which typically has a 5-10 min TTL).
The service information is stored in a replicated Apache Cassandra instance,
which provides active/active regions (with each region being in its own
consistency group) and eventual consistency between regions). This is also
fronted by a custom EVcache
which provides a Memcached API but with tools to permit remote eviction of
cached information, and the soon-to-be-open-sourced
Dinomyte which provides
generic replication and cache invalidation processes.
To implement clients that provide this routing information as well as stats the ribbon client libraries and karyon server libraries are used, which provide a generalised RPC mechanism. This is combined with the configuration information and routing provided by Asgard, and configuration updates by Archaius.
To fail fast in the event of a dependent service failure, a circuit breaker mechanism is used to disconnect the Hystrix library provides a means to measure the performance of individual services and to disconnect them if they go past a certain error threshold or delay. By failing fast, the system is able to detect that an error condition has occurred and re-route the requests to a more successful part.
To test whether this works, various monkeys are used to take out parts
of the system, under the
Simian Army umbrella.
The original Chaos Monkey is used to take out individual nodes in the system,
whilst the Chaos Gorilla can be used to take out sets of services or nodes.
Should a whole zone need to be taken out, Chaos Kong will wipe out an entire
zone to ensure that the system is working as expected, all the time working
in production. After all, if failure is a natural event and the system has
practice from recovering, then when an unplanned event occurs it will be more
likely to keep going on in the same way.
Overview of Vert.X
Tim Fox gave an overview of Vert.X, describing it as a lightweight, reactive application superficially similar to Node.JS and inspired from Erlang. However, unlike Node.JS and Erlang, Vert.X is a polyglot language with implementations and libraries in many different languages and a common communication bus for passing messages between them.
At its core lies an enterprise bus that communicates with JSON messages and which subscribes many individual single-threaded processes called verticles. These can be written in any language supported by Vert.X; messages are passed between them as immutable data structures and so can be materialised into any supported format. In this sense, it is similar to the Actor model, except without any shared state.
Running a vert.x program can be achieved with vertx run
followed by a class
or function that implements the verticle. For interpreted languages such as
Groovy or JavaScript the VM starts and begins executing the code directly;
for compiled languages like Java and Scala a compilation process is kicked
off to compile the code ahead of time. In development, these can be reloaded
dynamically to facilitate development and debugging.
The event bus can be distributed between different JVMs, enabled by running
with the -cluster
flag. This will discover other JVMs running on the same
host (though cross-host clustering is also possible) and sync up messages
between the two to allow for distributed processing.
It is also possible to sync the event bus to a web browser, using the JavaScript libraries to consume the event bus data over HTTP. Since it uses the same API but does not enforce any one particular style of user interface, it is possible to generate a UI using whatever framework or toolkit is desired.
Modules are also possible in Vert.X, using a mod.json
descriptor file
and zero or more resources or verticles. These can be run with vertx runmod
and will start up any entry points discovered. It is also possible to refer
to modules stored in Maven central, using a group~artifact~version
syntax;
for example, running vertx runmod io.vertx~hello-mod~1.0
will download
and run the
io.vertx/hello-mod/1.0/hello-mod-1.0.jar
file and start executing the contents. It is also possible to generate a
standalone application by running vertx fatjar
which will download all the
dependencies and create a single jar containing the bootstrap library code and
dependencies.
As well as clustered mechanisms it is possible to run vertx in a high
availability mode, with -ha
. This implies -cluster
and allows multiple JVMs
to take the load of individual vertx modules so if a high-availability group
has been created that provides two modules and one of those module processes
dies, a new one will be instantiated on a different node in the HA cluster.
Using Docker in Cloud Networks
Chris Swan talked about
Docker as a container mechanism for software.
Outside the box lies linux containers (using LXC) and some kind of union
filesystem such as AUFS or a snapshot system such as ZFS or BTRFS.
By specifying a dockerfile, with a set of build instructions (with the RUN
command) and an execution (with the CMD command) it is possible to automate the
booting of a container inside an existing machine. Since many of the host OS
libraries and functions are used it is possible to spin up many containers for
invocations. For example, a dockerfile might look like:
FROM ubuntu:12.04
MAINTAINER cpswan
RUN echo 'Hello World'
This would spin up a new container based on the ubuntu:12.04
image, and run
the program ‘Hello World’ and the quit.
Docker is invoked with the docker
command, which needs root privileges to
run. It is possible to run a pre-configured instance and expose port 1234 using
sudo docker run -d -p 1234 cpswan/demoapp
or if the local port needs to be
bound then sudo docker run -d -p 1234:1234 cpswan/demoapp
could be used
instead. Various docker examples are availble at Chris Swan’s
github page, and a blog discussing
how to implement multi-tier apps in docker
is also available.
Docker is restricted to running on Linux, but in the most recent release a version has been made available for running on OSX using Virtual Box to host a Tiny Core Linux image.
Scaling continuous deployment at Etsy
Avleen talked about how Etsy scaled
their deployment systems, mainly by identifying bottlenecks in their existing
deployment process and being able to speed those up. Etsy is a marketplace
for custom built products.
The Etsy codebase is a collection of PHP files which are deployed and pushed to the Etsy servers, with an Apache front end displaying them. The deploy time of this data set is around 15 minutes, so to enable more pushes per day they combine several unrelated changes in each push and synchronise between the members of those changes to increase the velocity of changes going out. In addition, by separating out configuration changes versus code changes, two parallel streams of pushes could occur, which freed up more resources for being able to push content out.
From a micro optimisation, they used a tmpfs
to be able to serve the newly
pushed site quicker than before, and by hosting a custom apache module at the
server side were bale to atomically switch between old and new versions (named
‘yin’ and ‘yang’) by flipping a configuration bit to point between
DocumentRoot1 and DocumentRoot2.
The key deliverable was that if a process is slow, identify through measurement where the bottlenecks are and then design a system around those bottlenecks as a way of increasing throughput and decreasing latency for the system.
Elastic Search at The Guardian
Shay Banon and Graham Tackley discussed how The Guarding had used Elastic Search to provide real-time updates on how readers were interacting with the news site.
The Guardian has 5-6m unique visitors per day to its website, and also has applications for Android, Kindle, iPhone and iPad. Thanks to The Scott Trust The Guardian is a shareholder free organisation, and as such took advantage of the early internet to make all its content available on-line for free. As a result it is now the third largest English website in the world.
To track user interaction with the website, an initial 24h ‘hackday’ project which grepped the Apache logs to show the top content in the prior few minutes was launched. The success fo that project led to the creation of an Elasticsearch backed infrastructure for performing the same thing over the period of the last few minutes to the last few hours, days, or weeks.
The Guardian website uses a form of invisible pixel tracking; when a page is loaded and rendered via the browser, an image is requested from the remote server. This causes an event to be posted to Amazon SNS to post a message into Amazon SQS which then loads it into an Elasticsearch database. By storing the time, referrer, and other geolocation data it is possible to build up a data set which richly describes who is using the site at any one time.
By using Elastic search to search the content for specific filters (e.g. all sport references in the last five minutes, or all references from the front page) it is possible to build real-time statistics of who is using the software and at what time. By capturing the geolocation data it is possible to get a heatmap of where readers are located in the world, and by using a set of graphs (displayed by D3 in the dashboard) be able to give a drill-down view of the data and to permit editorial decisions to be made about what content to promote or to follow up on.
Elastic search provides most of these search functions out-of-the-box, including the date histogram which can partition requests into different time-based buckets. The graphs can be used to correlate the information seen, including being able to find out what particular tweet caused an upsurge in information, or when a reddit page went viral to drive content.
Conference wrap-up
The good news is that my hat was found so I can go home with a warm head. But it’s the end of a long week and QCon whilst very enjoyable is also very exhausting.
The take-aways from this conference were:
- Micro-services are big. It’s not just SOA in a different guise (which was really about RPC with giant XML SOAP messages) but rather URIs and message passing or JSON data structures. Virtually every talk on distributed architecture or resilience talked about how to break a monolithic application into individual services which could be developed, deployed and monitored individually.
- Agile companies are deploying components many times per day. This isn’t limited to small organisations; this happens at internet-scale companies that are serving hundreds of thousands of requests per second, including financial companies. The key to this is removal of process and pontification and replacing it with automation so easy that a new employee can deploy to production before lunch on their first day.
- Failure happens. Failing to plan for failure happening is to plan for failure. Any system that is based on a monolithic application or runs in a single instance is both non-scalable and non-fault tolerant by design. The same set of benefits that come from scaling also comes from failure planning. Testing this by destroying instances in production is an extreme way of demonstrating that the system will work in the case of failure, but it gives monitoring and processes practice for when unexpected events can occur.
- The GPL is effectively dead for corporate sponsored open-source projects. All of the firms that had created code for their needs or exposed it for others to use were licensed under permissive licenses such as Apache, MIT, BSD or EPL. In fact, the only GPL licensed software at the talk was OpenJDK, and even that didn’t get a mention by name but under the Java 8 moniker. The only place the GPL is being used by companies is when they actively want to be anti-corporate and are selling subscription support services to large organisations.
Of course QCon London has many tracks, and I only covered a few of them in my write-ups. The videos of the conference will be made available from www.infoq.com (whom I write for) over the next few months. Most of the presentations have been recorded, and the slides should be available from the qconlondon.com website.
Finally, QCon New York is coming up soon – this time, combined with OSGi DevCon. Hope to see you there!