Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Continuous distributed quality assurance

2006, test

Adam Porter (from the University of Maryland) presented a fascinating talk on Distributed Continuous Quality Assurance, and in the process, raised many valid points about the testing of existing systems. The key point highlighted was that in a complex system where there are many possible configurations, performing simple unit tests is not enough to ensure quality of the final product. The talk was subdivided into; distributed continuous quality assurance; performance-oriented regression testing; and reliable effects screening. There was a lot of information in a short period of time; no doubt, the video will be available which will make it much easier to digest (as well as links to the presentation being available in the future, and also other papers from Adam's homepage).

Distributed continuous quality assurance is a process that attempts to exercise tests of the system with many different configurations. A complex piece of software such as ACE+TAO+CIAO (or ATC for short; a CORBA implementation) has a huge codebase (2 million lines of code) and over 500 configuration options (both run-time and compile time). Not only that, but as the system is developed as open-source, there are updates and patches very regularly which keeps the project moving; but only a small fraction of the tests look at the different configuration space.

To solve this problem, a version of the software is compiled and run with a set of configuration options, tests are executed, and the results fed back; the process then iterates over a variety of different configuration options. Obviously, this kind of process is very expensive in terms of both time and computing power, and so the only way of achieving some kind of goal like this is to run these tests over a grid of different computer systems. The scheduler determines what configuration options to try, and then divides the QA space into a number of running tests, which are then executed by the grid. This analysis can suggest which combinations to try using different heuristics; for example, attempting to exercise the entire configuration space may not be feasible; but trying a sampling of distinct configurations may prove useful. Another approach is to have a partition of tests using t-way covering arrays that can reduce the number of combinations to try. For example, the following configuration array provides all pairwise combinations of (A,B) and (B,C) as well as (A,C), and using only 9 combinations (instead of the 33 combinations to fully enumerate all possible combinations):


In fact, running this system against ATC found a number of bugs (some of which were new) related to the configuration options and the option processing code itself.

Performance-oriented regression testing can determine whether a change to the system causes undesirable side-effects to the performance of the system. The goal is to determine if an upgrade affects performance in a bad way prior to the patch making it into the system. (My own observation is that there are relatively few projects that do test for performance; most of the 'green bar' testing of projects like JUnit and TestNG only test the functional aspect and ignore any performance-related aspects. One project that does is Eclipse, at least for the JDT and platform; but it's fairly unique in that respect.)

Running performance tests automatically is non-trivial; not only do you need to leave tests running for some time in a real system to gain accurate performance metrics, the presence of various options may affect the result that may be unrelated to the patch. Not only that, but the overall result may be affected by how heavy the load is on the machine at the time the test was being run, and so forth. However, if all that is needed is a flag for further investigation, some of these concerns can be waived; and in any case, the test is repeatable to allow for more accurate measures to be obtained if needed.

Testing performance of every aspect of the system is likely to be time-consuming, and so a yardstick against which performance can be approximated can be useful. In a full analysis of the performance of ATC, there were only a few parts of the system that actually affected the over-all performance (c.f. the fact that not all code is on the critical path). Thus, once you have found the metrics that are indicative of overall performance, it is only necessary to test for performance against those metrics (which can be considerably quicker). In an analysis of ATC, a noticeable performance drop was highlighted during development after a set of patches were committed; running the analysis with the reduced set of performance indicators highlighted exactly the same problem, but instead of taking 200 hours to complete, only took 2 minutes (thus making it practical for integrating into a build-and-test process).

Of course, more write-up of the talk and others can be found at Adam's homepage at the University of Maryland. The video is now available.