(note, mbligh removed abstract - same as paper proposal) Andypants, this is becoming a bit of a brainstorm list as opposed to formal text just yet, I think that's OK for the first couple of days personally, hope you're OK with it.
Introduction
(note, I think I should shorten this substatially so it's not one long meaningless jibber ;-))
Some changes in the 2.6 development process have made fully automated testing vital to the ongoing stability of Linux. The pace of development is constantly increasing, with a rate of change that dwarfs most projects. The lack of a separate 2.7 development kernel means that we're feeding change more quickly and directly into the main tree. Moreover, the breadth of hardware types that people are running Linux on is staggering. Therefore it is vital that we catch at least a subset of introduced bugs earlier on in the development cycle, and keep up the quality of the 2.6 kernel tree. Faster identification of problems means that the issue is still fresh in the developers mind, and the offending patch is much more easily removed (not buried under thousands of other changes). With fully automated test, it is also possible to use other techniques to make debugging and problem identifcation easier (that would be impractical with manual testing), such as automated binary chop search amongst thousands of patches, to weed out offending changes. We can compile hundreds of different configuration files on every release, cross-compiling for multiple different architectures. We can identify performance regressions and trends, adding statistical analysis. Tests needed run the gamut from compile testing to boot testing to regression, function, performance, and stress testing. From disk intensive to compute intensive to network intensive loads. We need an open-source test harness to enable sharing of tests, and the ability to "pass" the reproduction of issues from developer to developer. This paper will cover both the benefits of automated testing, the problems of running the tests, and communicating the issues back to the development community in an effective fashion. Much of the work above has been started, and exists today. I will present the open-source test harness used for such testing, and publication methods on http://test.kernel.org
The Problem
- (What are we trying to do, why bother)
- It is critical for any project to keep a high level of software quality, and consistent interfaces to other software that it has to interact with. There are several methods for increasing quality, but none of these works in isolation, we need a combination of:
- Skilled developers, carefully developing code
- Code review
- Static code analysis
- Regression testing
- Functional testing of new functions
- Performance testing
- Stress testing
- It is critical for any project to keep a high level of software quality, and consistent interfaces to other software that it has to interact with. There are several methods for increasing quality, but none of these works in isolation, we need a combination of:
Whilst testing will never catch all bugs, it will improve the rate of quality. Improved code quality results in a better experience not only for users, but also for developers.
- Why it's important to do regular testing.
- It is important to catch bugs as soon as possible after they are created. This results in:
- Less replication of the bad code into other codebases
- Fewer people will encounter the bug if it is fixed faster
- The code is still fresh in the mind of the developer who wrote it
- Less likelihood of other subsequent changes interacting with it
- It is important to catch bugs as soon as possible after they are created. This results in:
- Particular challenges for Linux
- Linux has a constanty high rate of change
- Linux runs on a staggeringly diverse array of hardware.
- Challenges for an open-source development model.
- There is no mandate to do testing on your own code
- There is no easy funding model for doing regular testing, as in a large corporations System Test group.
- Machine-power over manpower - scaling the load.
- Increasingly, machines are cheap, but manpower is expensive.
- Combining multiple factors - rate of change, diversity of useful tests, diversity of hardware to run it on
- results in a staggering number of useful testing combinations. We have thousands of potential contributors and testers around the planet, but need to coral this effort into something the developers can use and understand.
- Linux's evolutionary approach to software development fits well with high-frequency regression testing.
- Small incremental changes make it easier to isolate the offending change
- Frequent releases (-git snapshots are now twice daily) give easy sets of packages to test.
- The workflow (mbligh to draw flowchart, hourglass shape)
- Developer patches
- Staging trees (-mm, acpi tree, libata tree, input tree, etc)
- Mainline
- Distros
- Users
- More flow (mabye shoved into above diagram)
- Identify problem
- (possibly reject change based on this)
- Debug
- Patch
How automated testing in general
- who is doing testing
- Distros are doing the most testing
- ISVs are h/w vendors are doing testing too, but normally based on distros, TOO LATE.
- what tests are used
- lists of tests, what they contribute in terms of the PROBLEM We are often trying to verify that the machine and OS stack is fit for a particular workload. The real workload is often difficult to set up, may require proprietary software, and is overly complex and does not give sufficiently consistent reproducable results, so we use a simplified simulation of that workload encapsulated within a test.
- Functional / Unit tests
- Verify the function of one particular part / function of the system.
- LTP
- Crashme
- Performance tests
- Verify the performance of a workload (check for regressions, provide profiling information for tuning, etc).
- Kernbench
- AIM7 / reaim
- bonnie, tbench, iobench
- netperf
- Stress tests
- See if we can make the system fall over in an over-tuned situation, up to a DOS attack.
- Kernbench with make -j
- Most of the perf tests
- Cerberus ?
- Functional / Unit tests
- lists of tests, what they contribute in terms of the PROBLEM We are often trying to verify that the machine and OS stack is fit for a particular workload. The real workload is often difficult to set up, may require proprietary software, and is overly complex and does not give sufficiently consistent reproducable results, so we use a simplified simulation of that workload encapsulated within a test.
- how are various 'results' made
- build/boot is easy from any test
- performance is over many tests
- repeats within one test (eg run kernbench 10 times) give std dev as well as averages.
- Automated search for individual offending change is practical (log2 n)
How t.k.o
APW: I am taking a first stab at fleshing this out... APW: Ok, a first stab at this section is now in the HTML (see andypants topic)
- architecture of the solution (need big fat dia diagram here - mbligh will do)
- Elements:
- Server (describe a simple queuing system)
- Client (autotest)
- Conmux
- Results gather and publish
- Results analysis
- Elements:
- running async tests
- --- Andy, what does async mean here ? --- apw: it means that tests are just run as they are needed, and later
- grouped into result sets as a separate phase.
- --- Andy, what does async mean here ? --- apw: it means that tests are just run as they are needed, and later
- merging later to 'cute' graphs/matrix
- examples of how it 'works'
- describe the perf issue you found, how it went round and round
- from you to me to patcher, to you, to me 'bad'
- how we find lots of build problems before they hit mainline
- describe the perf issue you found, how it went round and round
- linking to community
- how is it triggered
Why it's currently pathetically inadequate
(IBM only, far too few tests, crap analysis, restrictive data feed, etc).
How it should be
- anyone testing contributing to the results set
- generating a RAG for each architecture/release
- drill down to each test contributing to that status
- tests should be first class -- not jobs
Other harnesses
APW: am taking a stab at this section...
- Other test harness, and what's wrong with them
- General points
- Closed Source
- Difficult to understand / maintain
- Don't put results in a consistent parsable framework
- Don't consistently detect errors
- Are an unremovable part of a large complex harness.
- LTP
- what it is
- why arn't we using it
- General points
Open client
- why we want it -- first step to anyone contributing
- what it will offer
- anyone can test
- can send a test case to patch owner
- common results
- Public and private test repos, allow proprietary tests, encourage open ones.
- Will do cross-architecture, and cross-distro. Won't do other than Linux - too complex / obfuscated.
Future Linking
- output to buzilla etc
- input ... pre testing -- al viro wants to test what
- more than mainline and -mm
Future
- open client
- open test repository
- cleverer
- automatic bisection??
- expanding testing in general
