The final day of QCon London 2014 is upon us, and as is usual the morning after the night before is never easy. In my case, having both missed my planned train home (and getting home an hour later as a result) as well as having mislaid my hat I decided to take a punt on this morning’s keynote on world 2.0. Apparently cheese was involved.
Lessons learned in scaling Twitter
Brian Degenhardt talked about some of the challenges when scaling Twitter. Initially developed as a monolithic application, by separating the system down into separate services meant that the components could be scaled independently from each other, both at the network layer but also the development layer. A MySQL database was used to hold everything with a rails app fronting the code.
The second generation twitter split up the monolith into different back-end data storage layers, using gizzard as a sharding mechanism for MySQL databases and a redis instance for storing timeline information. Subsequent iterations replaced the front-end components with services as well, which allowed more heavily used services (such as the timeline and user tweet details) to be scaled horizontally when needed.
Fortunately when a request comes in to Twitter the request is composable into different services, and by using non-blocking futures it is possible to scale these by compositing the results of a series of lookups into one chain. By executing many requests in parallel but having that hidden by the underlying futures framework means that the code is easy to understand whilst still taking advantage of multiple threads.
When a write to Twitter happens, it gets split into a lookup of all the subscribers and then writes a copy of each into all the timelines. This involves a fan-out (lookup of affected destinations) and then a sharded write across the database. For providers that are subscribed to the firehose, the feed goes to them as well – around 30Mb/s in total. In the case of the Lady Gaga tweet replying to the Mars rover, this resulted in a set conjunction of around 1.5 million and 40 million followers, which took some time in the fan-out processing.
In terms of measuring the performance of Twitter, once the data stream starts moving into the tens of errors per second (99.999% uptime of 300,000 requests per second), it’s impossible to focus on individual events in terms of underlying analysis. By focussing on a statistical view of the events, it’s possible to get a much better picture of overall system health than if a single event is decoded.
To that extent, Twitter uses a distributed tracing system called Zipkin (demo) which performs tracing throughout the lifetime of a request, based on Google’s Dapper paper which was written about but never open-sourced. As a result, Twitter re-implemented it based on the content of the paper and open-sourced it instead.
The underlying service calls use HTTP for external communication, and once inside the boundary use Thrift for passing RPC messages and using a standardised library that provides load balancing, service discovery and metrics generation and logging. These metrics are collected centrally and can be visualised using histograms or by using a view of the data points with percentiles that show when the system is suffering under a higher load.
Twitter open-sources most of their code for high performance services via their GitHub repositories under an Apache license.
Scaling at Netflix
Ruslan Meshenberg discussed how Netflix leverages mutli-regions to increase availability. In fact, he said that the talk was “Really about failure” and how it’s handled. rather than anything else.
He showed a pyramid of failures, with top incidents being those that caused negative PR (mitigated by active/active pairs and game day practicing), customer service incidents (mitigated by better tools and practices), metrics impact (mitigated by data tagging and feature enable/disabling), and automated failover/recovery processes.
The key should be about how to respond to failure; with enough systems, eventually at least one will have a hardware-related failure with a disk or CPU dying, or natural events such as a lightening strike taking out power to a data centre. It’s possible that human error will be involved (configuration, bad code push etc.) but the key is to be able to recover quickly and painlessly from any events that may occur.
In the case of an Amazon ELB failure, in which the elastic load balancing configuration was lost, Netflix lost connectivity for a few hours on Xmas eve in 2012. To patch this immediately Isthmus was created to front-end all services coming in to ELB to provide a backup level of routing should ELB fail again. This was a spike solution aimed at providing solely ELB related traffic resilience; it grew into Netflix Zuul which provided a load-balancing and front-end re-routing for all of the (non-streaming) Netflix services. In conjunction with various DNS update layers fronted by denominator, which provides a cross-service DNS updating tool, any service can be taken off-line and recovered by updating either routing information or by updating DNS entires to point to a different location (which typically has a 5-10 min TTL).
The service information is stored in a replicated Apache Cassandra instance, which provides active/active regions (with each region being in its own consistency group) and eventual consistency between regions). This is also fronted by a custom EVcache which provides a Memcached API but with tools to permit remote eviction of cached information, and the soon-to-be-open-sourced ribbon client libraries and karyon server libraries are used, which provide a generalised RPC mechanism. This is combined with the configuration information and routing provided by Asgard, and configuration updates by Archaius.
To fail fast in the event of a dependent service failure, a circuit breaker mechanism is used to disconnect the Hystrix library provides a means to measure the performance of individual services and to disconnect them if they go past a certain error threshold or delay. By failing fast, the system is able to detect that an error condition has occurred and re-route the requests to a more successful part.
To test whether this works, various monkeys are used to take out parts of the system, under the Simian Army umbrella. The original Chaos Monkey is used to take out individual nodes in the system, whilst the Chaos Gorilla can be used to take out sets of services or nodes. Should a whole zone need to be taken out, Chaos Kong will wipe out an entire zone to ensure that the system is working as expected, all the time working in production. After all, if failure is a natural event and the system has practice from recovering, then when an unplanned event occurs it will be more likely to keep going on in the same way.
Overview of Vert.X
Tim Fox gave an overview of Vert.X, describing it as a lightweight, reactive application superficially similar to Node.JS and inspired from Erlang. However, unlike Node.JS and Erlang, Vert.X is a polyglot language with implementations and libraries in many different languages and a common communication bus for passing messages between them.
At its core lies an enterprise bus that communicates with JSON messages and which subscribes many individual single-threaded processes called verticles. These can be written in any language supported by Vert.X; messages are passed between them as immutable data structures and so can be materialised into any supported format. In this sense, it is similar to the Actor model, except without any shared state.
Running a vert.x program can be achieved with
vertx run followed by a class
or function that implements the verticle. For interpreted languages such as
for compiled languages like Java and Scala a compilation process is kicked
off to compile the code ahead of time. In development, these can be reloaded
dynamically to facilitate development and debugging.
The event bus can be distributed between different JVMs, enabled by running
-cluster flag. This will discover other JVMs running on the same
host (though cross-host clustering is also possible) and sync up messages
between the two to allow for distributed processing.
Modules are also possible in Vert.X, using a
mod.json descriptor file
and zero or more resources or verticles. These can be run with
and will start up any entry points discovered. It is also possible to refer
to modules stored in Maven central, using a
for example, running
vertx runmod io.vertx~hello-mod~1.0 will download
and run the
file and start executing the contents. It is also possible to generate a
standalone application by running
vertx fatjar which will download all the
dependencies and create a single jar containing the bootstrap library code and
As well as clustered mechanisms it is possible to run vertx in a high
availability mode, with
-ha. This implies
-cluster and allows multiple JVMs
to take the load of individual vertx modules so if a high-availability group
has been created that provides two modules and one of those module processes
dies, a new one will be instantiated on a different node in the HA cluster.
Using Docker in Cloud Networks
Chris Swan talked about Docker as a container mechanism for software. Outside the box lies linux containers (using LXC) and some kind of union filesystem such as AUFS or a snapshot system such as ZFS or BTRFS.
By specifying a dockerfile, with a set of build instructions (with the RUN command) and an execution (with the CMD command) it is possible to automate the booting of a container inside an existing machine. Since many of the host OS libraries and functions are used it is possible to spin up many containers for invocations. For example, a dockerfile might look like:
1 2 3
This would spin up a new container based on the
ubuntu:12.04 image, and run
the program ‘Hello World’ and the quit.
Docker is invoked with the
docker command, which needs root privileges to
run. It is possible to run a pre-configured instance and expose port 1234 using
sudo docker run -d -p 1234 cpswan/demoapp or if the local port needs to be
sudo docker run -d -p 1234:1234 cpswan/demoapp could be used
instead. Various docker examples are availble at Chris Swan’s
github page, and a blog discussing
how to implement multi-tier apps in docker
is also available.
Docker is restricted to running on Linux, but in the most recent release a version has been made available for running on OSX using Virtual Box to host a Tiny Core Linux image.
Scaling continuous deployment at Etsy
Avleen talked about how Etsy scaled their deployment systems, mainly by identifying bottlenecks in their existing deployment process and being able to speed those up. Etsy is a marketplace for custom built products.
The Etsy codebase is a collection of PHP files which are deployed and pushed to the Etsy servers, with an Apache front end displaying them. The deploy time of this data set is around 15 minutes, so to enable more pushes per day they combine several unrelated changes in each push and synchronise between the members of those changes to increase the velocity of changes going out. In addition, by separating out configuration changes versus code changes, two parallel streams of pushes could occur, which freed up more resources for being able to push content out.
From a micro optimisation, they used a
tmpfs to be able to serve the newly
pushed site quicker than before, and by hosting a custom apache module at the
server side were bale to atomically switch between old and new versions (named
‘yin’ and ‘yang’) by flipping a configuration bit to point between
DocumentRoot1 and DocumentRoot2.
The key deliverable was that if a process is slow, identify through measurement where the bottlenecks are and then design a system around those bottlenecks as a way of increasing throughput and decreasing latency for the system.
Elastic Search at The Guardian
The Guardian has 5-6m unique visitors per day to its website, and also has applications for Android, Kindle, iPhone and iPad. Thanks to The Scott Trust The Guardian is a shareholder free organisation, and as such took advantage of the early internet to make all its content available on-line for free. As a result it is now the third largest English website in the world.
To track user interaction with the website, an initial 24h ‘hackday’ project which grepped the Apache logs to show the top content in the prior few minutes was launched. The success fo that project led to the creation of an Elasticsearch backed infrastructure for performing the same thing over the period of the last few minutes to the last few hours, days, or weeks.
The Guardian website uses a form of invisible pixel tracking; when a page is loaded and rendered via the browser, an image is requested from the remote server. This causes an event to be posted to Amazon SNS to post a message into Amazon SQS which then loads it into an Elasticsearch database. By storing the time, referrer, and other geolocation data it is possible to build up a data set which richly describes who is using the site at any one time.
By using Elastic search to search the content for specific filters (e.g. all sport references in the last five minutes, or all references from the front page) it is possible to build real-time statistics of who is using the software and at what time. By capturing the geolocation data it is possible to get a heatmap of where readers are located in the world, and by using a set of graphs (displayed by D3 in the dashboard) be able to give a drill-down view of the data and to permit editorial decisions to be made about what content to promote or to follow up on.
Elastic search provides most of these search functions out-of-the-box, including the date histogram which can partition requests into different time-based buckets. The graphs can be used to correlate the information seen, including being able to find out what particular tweet caused an upsurge in information, or when a reddit page went viral to drive content.
The good news is that my hat was found so I can go home with a warm head. But it’s the end of a long week and QCon whilst very enjoyable is also very exhausting.
The take-aways from this conference were:
- Micro-services are big. It’s not just SOA in a different guise (which was really about RPC with giant XML SOAP messages) but rather URIs and message passing or JSON data structures. Virtually every talk on distributed architecture or resilience talked about how to break a monolithic application into individual services which could be developed, deployed and monitored individually.
- Agile companies are deploying components many times per day. This isn’t limited to small organisations; this happens at internet-scale companies that are serving hundreds of thousands of requests per second, including financial companies. The key to this is removal of process and pontification and replacing it with automation so easy that a new employee can deploy to production before lunch on their first day.
- Failure happens. Failing to plan for failure happening is to plan for failure. Any system that is based on a monolithic application or runs in a single instance is both non-scalable and non-fault tolerant by design. The same set of benefits that come from scaling also comes from failure planning. Testing this by destroying instances in production is an extreme way of demonstrating that the system will work in the case of failure, but it gives monitoring and processes practice for when unexpected events can occur.
- The GPL is effectively dead for corporate sponsored open-source projects. All of the firms that had created code for their needs or exposed it for others to use were licensed under permissive licenses such as Apache, MIT, BSD or EPL. In fact, the only GPL licensed software at the talk was OpenJDK, and even that didn’t get a mention by name but under the Java 8 moniker. The only place the GPL is being used by companies is when they actively want to be anti-corporate and are selling subscription support services to large organisations.
Of course QCon London has many tracks, and I only covered a few of them in my write-ups. The videos of the conference will be made available from www.infoq.com (whom I write for) over the next few months. Most of the presentations have been recorded, and the slides should be available from the qconlondon.com website.