PhD Proposal: Quantifying Flakiness and Minimizing its Effects on Software Testing

Talk
Zebao Gao
Time: 
02.01.2017 10:00 to 11:30
Location: 

AVW 4172

In software testing, test inputs are passed into an application under test (AUT); the AUT is executed; and a test oracle checks the outputs against expected values. There are cases when the same test case is executed on the same AUT multiple times, and it passes or fails during different runs. This is the test flakiness problem and such test cases are called flaky tests.

The test flakiness problem makes test results and testing techniques unreliable. Flaky tests may be mistakingly labeled as failed, and this will increase not only the number of reported bugs testers need to check, but also the chance to miss real faults. The test flakiness problem is gaining more attention in modern software testing practice where complex interactions are involved in test execution, and this problem raises several new challenges: What metric should be used to measure how flaky a test is? What are the factors that cause or impact flakiness? And how to minimize the effects of flakiness?

This research proposes a systematic approach to quantitively analyze and minimize the effects of flakiness. First, a novel entropy-based metric is introduced to quantify the flakiness of different layers of test outputs (such as code coverage, GUI state, and invariants). Second, the impact of a common set of factors on test results in system interactive testing is examined. Last, a novel model of test oracles is introduced to minimize the impact of flakiness.

Preliminary work on five open source applications quantifies test flakiness at the three output layers mentioned above. The study empirically analyzes the impact of factors including the system platform, Java version, application initial state and tool harness configurations, and observes a large impact on AUTs when these factors were uncontrolled (as many as 184 lines of code coverage difference between runs using the same test cases, and up to 96 percent false positives with respect to fault detection.) A second experiment evaluates the effectiveness of new, manually constructed test oracles on a set of open source applications with real faults. The results demonstrate that by checking a selected subset of software state after execution of the whole test case, the improved oracles could greatly reduce false positives and negatives.

The remaining work will automatically customize test oracles for each object instead of using a manually picked, fixed subset of properties for all objects. The research proposes to utilize the flakiness metric to identify the most stable properties for each object.

Examining Committee:

Chair: Dr. Atif Memon

Dept rep: Dr. Ashok Agrawala

Member: Dr. Alan Sussman