Course Number:
CMSC737.
Meeting
Times: Tue. Thu. - 9:30AM - 10:45AM (CSIC 2107)
Office Hours:
Tue. Thu. - 10:45AM - 12:00PM (4115 A. V. Williams Building)
Catalog
Course Description: This course will examine
fundamental software testing and related program analysis techniques. In
particular, the important phases of testing will be reviewed, emphasizing the
significance of each phase when testing different types of software. The course
will also include concepts such as test generation, test oracles, test
coverage, regression testing, mutation testing, program analysis (e.g.,
program-flow and data-flow analysis), and test prioritization.
Course
Summary: This course will examine fundamental
software testing and program analysis techniques. In particular, the important
phases of testing will be reviewed, emphasizing the significance of each phase
when testing different types of software. Students will learn the state of the
art in testing technology for object-oriented, component-based, concurrent,
distributed, graphical-user interface, and web software. In addition, closely
related concepts such as mutation testing and program analysis (e.g.,
program-flow and data-flow analysis) will also be studied. Emerging concepts
such as test-case prioritization and their impact on testing will be examined.
Students will gain hands-on testing/analysis experience via a multi-phase
course project. By the end of this course, students should be familiar with the
state-of-the-art in software testing. Students should also be aware of the
major open research problems in testing.
The grade of
the course will be determined as follows: 25% mid-term, 25% final exam, 50%
project.
Credits:
3
Prerequisites:
Software engineering CMSC435 or equivalent.
Status with
respect to graduate program: MS qualifying course (Midterm+Final exam), PhD core (Software Engineering).
Syllabus:
The following topics will be discussed (this list is tentative and
subject to change).
Questions posed by
students: THIS SPEAKER IS NOT ACCEPTING NEW
QUESTIONS
1.
Nathaniel
Crowell: In this work (and in any further
work) where was importance placed with regards to the rate of false positives
versus false negatives? Which one is more detrimental to the overall results? In
other words, which was (and will be) focused on to reduce it's
rate in the work (and future work)?
Answer
Outline: We would like to claim a fairly
``correct'' system (so low false positive rates), which can be attributed in
part to the validation through monitor models, as well as the iterative process
in which we ultimately mine for the association rules. Furthermore, because of the structurally-guided test generation plus iterations, we
would like to also claim a fairly ``complete'' system as well (so also low
false negative rate). The relative
importance placed on FP vs. FN rates is not really a concern for us; what can
be said is that we trust the correctness for any given number of iterations of
our approach (which has shown to have different FN rates), but the completeness
comes through these iterations of structurally-guided tests.
2.
Amanda
Crowell: Briefly describe some of the
drawbacks of using models and how Huang et al. addressed the drawbacks in their
research.
3.
Christopher
Hayden: While there seems to be a good
deal on information regarding extraction of requirements from test cases, there
doesn't seem to be much discussion surrounding the initial generation of the
test cases themselves. How were
the test cases used in your experiment generated? More generally, is there any relationship between the
criteria used to generate test cases and the ability of those test cases to
reveal invariants? For example, if
I generate one test suite using data flow criteria, and I generate a second
test suite using branch coverage criteria, how do those test suites compare in
their ability to reveal invariants?
4.
Rajiv
Jain: I believe that you used two single
variables for associations in your research. If A and B showed an association
and B and C showed an association, then did it hold that A and C held an
association in all cases. Was this caught through your test model?
Answer
Outline: The situation where we find rules
A -> B, as well as B -> C does not come up in our work. Because our work falls under what is
considered ``class association rule mining,'' the consequent of any given
association rule is of some type, and we never have two predicates in a given
association rule of the same type.
For this work, the right hand side is always the new state that we
transition to, and we do not consider rules that have ``new state'' information
on the left hand side.
Transitivity of rules is therefore not an issue for us.
5.
Shivsubramani Krishnamoorthy
6.
Srividya Ramaswamy
7.
Bryan
Robbins: One of the primary decision
points of the presented requirements abstraction framework is the decision of
how to generate test cases. The authors explore one effect of this decision by
comparing the similarity of invariant sets generated from test case sets that
provide partial or full coverage, respectively. Suppose we define a measure of
"% covered" for a given set of random test cases. How would we expect
%covered to correlate with the similarity of resulting invariant sets, given
the experimental findings of this study? Would this relationship capture all of
the authors' findings regarding invariant set similarity, or only part?
Answer
Outline: We have shown that for values of
%covered approaching 1, the similarities are also close to 1 for our structural
guided tests, while for %covered values lower (using random generation), the
similarities of the invariant sets are lower. This is the expected behavior, however for a given run we
may find sets that are very similar.
Obviously, as the accuracy of the invariant sets approaches 1 (with a
low FP rate), the similarities will all be close to 1 as well. As the accuracy is lower, the variation
increases, and because %covered could entail a range of accuracies, it would
not be surprising to see variation for %covered values
not near 1.
8.
Praveen
Vaddadi: How is
strength of an association rule related to its support and confidence?
9.
Subodh Bakshi
10.
Samuel
Huang: None
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie
13.
Cho,
Hyoungtae: Does
one have to consider independence/interference between recovered invariants
before aggregating all invariants not found to not be invalid over any
experiment as true invariants?
14.
Hu,
Yuening: Your
experiment is done 5 times. What is the difference among them? Is a current one
based on the previous one, or each of them is done independently and is just
the repeat of the previous one? If so, why do you use Jaccard
Similarity for more analysis?
15.
Liu,
Ran: How does one determine if any
association rules are invariants or not?
Answer
Outline: We use the term ``invariants''
only at the end of the entire iterative process of repeated test case
generation, mining, and validation (repeated). At any given iteration, we are confident of the most
recently mined association rules after the validation step. This set of
invariants is then further refined and augmented in future iterations. Following the last validation step
performed, we conclude that this set of association rules comprises of the
``invariants.''
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep: In terms of coming up of test cases to discover invariants, was length
of the test cases considered? Do you think longer test cases will discover more
invariants?
Answer
Outline: Rather than focusing on test case
length, we focused on test suite coverage. We felt that this was more indicative of exploring the state
space of the design model.
However, one would expect that with longer test cases, more of the state
space would be uncovered as well.
We do show that we get almost complete coverage with our iterative
approach using structural coverage.
18.
Guerra
Gomez, John Alexis: This paper defines an
invariant to be a characteristic that is always true for all tests
executed. Consider a less strict definition of an invariant, i.e., a
characteristic that is mostly true (e.g., 90% or more of the time)? What
would be the advantages of such a modified definition? And disadvantages?
Questions posed by
students:
1.
Nathaniel
Crowell: Generally, how has the selection
of python for software development affected the project?? Specifically, how
have known flaws in various versions of python (e.g., memory leaks) guided your
development and testing process?
2.
Amanda
Crowell: What kind of testing was done to
verify that the signals being sent to the device were actually causing the
correct vibrations on the device?
3.
Christopher
Hayden: In your presentation, you noted
that you are currently working on implementing what appear to be large system
tests and that you plan to implement GUI testing at some point in the
future. However, I didn't see
anything about small system tests or unit tests. Are these something you plan on doing in the future as
well? If so, will you be using a
particular unit testing framework to generate the
tests, execute the tests, and recover the metrics?
4.
Rajiv
Jain: What makes testing python different
than any other programming language such as C/C++ or Java?
5.
Shivsubramani Krishnamoorthy: You displayed the code coverage with one of
the input .png image. So, am I right in assuming that the coverage is based on the
input image? If you provide a different image, will the coverage differ? I am
not able to imagine how the coverage will differ that way. It could be because
I don't know the modules in your source code. Can you throw light on it?
6.
Srividya Ramaswamy: Were the techniques you used to generate test cases for your large
system tests similar to the ones we have seen in class? (You mentioned that
your program uses command line operations after receiving an image to process.
That immediately made me think of 'Category Partition Method' as the example we
saw for that method was a Unix command). If you didn't use any of the
techniques we have seen in class, could you briefly describe the technique you
used?
7.
Bryan
Robbins: As others have mentioned, you
focused here on Large System Tests. Interpreted languages have a pretty nice
quality in that they can be explored interactively using the language's
interpreter. Have you considered how you could leverage some pretty simple
scripting in order to perform your future testing at a lower level?
8.
Praveen
Vaddadi: Does
the bytecode of a python program (*.pyc) contain information about statement coverage? If not,
how do we get the line coverage from bytecode
compilation?
9.
Subodh Bakshi: You said that you have test cases generated by the program receiving
images to process operations using command line and then sending the codified
image to the default IRIS device. You also said that you have a dummy device as
the oracle. But since, you do not do a diff, how exactly is reference testing
carried out?
10.
Samuel
Huang: Your work seems to step into many
different domains, notably touching upon challenges from both Computer Vision
and HCI. Given that you are
seeking to avoid increasing the sophistication of the hardware (to prevent the
cost of building a unit from raising further), perhaps you could draw upon some
of the existing approaches from these other fields to improve the software's
sophistication, especially at the user interface level. Even if additional controls are not
possible for the device currently, having a demo that simulates what would
happen if buttons were pressed may be of interest to potential investors. Does
this seem feasible?
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie: The
images sent to the IRIS device are very pixilated. Do you think it anti-aliasing would work for the vibration
images the same way it does for vision-based images?
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening: Do you
get the line coverage report based on "*.pyc"
files? For java and Emma, we need to compile the source code with
"debug=on" and then instrument, then the line coverage report can be
obtained. Do you need to the same thing?
15.
Liu,
Ran: You mentioned that there is a dummy
to capture the codified image output and keep the output in a log. What will
you do after that? Will you do reference testing by comparing the output to the
test oracle or any other technique to see if there are any faults?
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis:
Questions posed by
students:
1.
Nathaniel
Crowell: Based on the cited example that
"3 test sets with 5 users" produces better coverage than either
"1 test set with 15 users" or "15 test sets with 1 user",
could the reason for this just be that what is really important is diversity
along these two of the test axes rather than some "magic number" of 5
users?
2.
Amanda
Crowell: Both in the paper and in your
presentation, it was mentioned that participant recruitment plays a big role in
the results of usability testing.? Did the researchers
offer any suggestion as to what effective recruiting methods should be?? What
is your opinion on how one should recruit usability testers?
3.
Christopher
Hayden: The paper treats the number of
users and the number of user tasks as independent of one another in its
statistical calculations. It seems
that this might be suboptimal in the sense that clearly both are important
components in usability testing.
Would it make more sense to model these two quantities as related to one
another, say through a higher dimensional analysis that considers both
quantities simultaneously? For instance,
one could create a three dimensional plot where the axes are number of users,
number of user tasks, and number of problems found and perform similar
statistical analysis to that performed in the paper. Are you aware of any research that performs this type of
higher dimensional analysis with regard to usability testing?
4.
Rajiv
Jain: What makes something a usability
problem? According to the paper, it seems usability problems have something to
do with the amount of time a task takes. However, if I were to use that
measurement, a useable device such as a cool PHONE would be full of problems
given the time it would take my parents to use it.
5.
Shivsubramani Krishnamoorthy: From the paper and from other sources, I
understand that1-(1-probability of finding a
fault)number of users assumes that
the faults are equally detectable. It considers only the number of users but
not various critical variables like the type of the application, size of
application, type of test user and their experience etc. These may affect the
context, which may have various critical problems/faults associated with it.
Will it be true that once these variables are incorporated into the equation
may anyway disprove Jacob Neilsen's claim (especially
for software usability testing).
6.
Srividya Ramaswamy: The paper concludes saying that the role of number of users in a
usability test is not as significant as the important role of task coverage.
However, I think that these two factors are related. Could you comment on that?
7.
Bryan
Robbins: A number of the authors'
conclusions and suggestions for future research in the area involve improving
the effectiveness of both usability testers and user tasks. A large body of
work on the topic of finding problems in requirements and design artifacts suggests
that individual effectiveness often dominates performance. Given the authors
findings, would you expect the same to be true for usability testing? How could
we fairly test this hypothesis?
8.
Praveen
Vaddadi: The
paper claims that user task coverage is more important than number of
participants in predicting proportion of problems detected. However, do you
think it misses out an important factor of considering the type of the
participant?
9.
Subodh Bakshi: Is there a way to define the number of user tasks to improve the
proportions of problems?
10.
Samuel
Huang: The model used to address the
likelihood of discovering a fault/problem is using a ``noisy-or'' like approach which assumes independence between faults. Given that this is probably not always
true, can we find or otherwise determine more structured relationships between
faults? Can we then capitalize on
this structure to better predict certain ``types'' of faults?
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie: With
the Found(i) = N(1-(1-λ)^i), one of the concerns raise in class was that λ is not
constant for all faults. Instead of a constant, maybe substituting λ with a
random variable that looks like the distributions of detection rates (how many
users detect a particular fault) for faults. Do you think that there is a
common distribution for λ that can be used to generate a new 5 users like rule,
or do you think λ varies too much from program to program?
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening: What
is the influence of this paper? Is it successful in changing people's focus
from uses' number to users' tasks?
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis:
Questions posed by
students: THIS SPEAKER IS NOT ACCEPTING NEW
QUESTIONS
1.
Nathaniel
Crowell
2.
Amanda
Crowell: Do you agree with the assumption
that only the first fault found is of immediate concern?
Answer
Outline: The answer to this depends on the
goal of testing. I agree that for actual
software development and debugging that any faults found after the first could
just be as a result of the first rather than some additional fault in the software.
3.
Christopher
Hayden: The paper discusses the fact that
adaptive random testing achieves an F-measure that is very close to
optimal. However, no other testing
techniques are explicitly discussed.
It seems that test suites developed using various criteria, such as line
coverage or branch coverage, should correlate in some way with input space
coverage and should therefore have empirically measurable F-measures. How do test suites developed using such
criteria compare with adaptive random testing in their ability to approach the
optimal F-measure?
Answer
Outline:
4.
Rajiv
Jain: My understanding is that the input
domain can often be infinite and multi-dimensional. How would this algorithm
apply in these cases?
Answer
Outline: In these cases, it might not be
possible to explore the input domain using no knowledge of the software under
test. Using white-box techniques
it could be possible to try and determine the extremes of the input domain
without having to do a full analysis of the code to determine exactly all of
the critical values that produce different coverage.
5.
Shivsubramani Krishnamoorthy: Is F-measure the right criteria to measure
the effectiveness of a test strategy? There are possibilities that for a
strategy the first fault may be revealed late but many faults could be revealed
in quick succession after that. Another strategy may find the first bug with
lesser number of test cases, but may find just a few bugs altogether. So will
it be better to consider the f-measure for `n' faults, instead of just one
(first) fault - (i.e. number of test cases to reveal first `n' number of
faults).
Answer
Outline: Research exists to suggest that
for real faults, they tend to be clustered in the input domain. Given this knowledge, it seems likely
that a test case that reveals one fault will reveal others. Or that given some test case that has
been found to reveal one fault, other faults can be found by only making small
changes in the input space. These
reasons support the F-measure as a useful criterion of effectiveness.
6.
Srividya Ramaswamy: Currently, we are not aware of any algorithm to determine an optimal
test pattern for an arbitrary shape. We know that any arbitrary pattern can be
geometrically covered by a continuous set of squares (high school Mathematics).
Thus, can we find the F-measure of each of those squares and then find a way of
approximating that result to the F-measure of the arbitrary pattern?
Answer
Outline:
7.
Bryan
Robbins: The authors define an
upper-limit relative to the F-measure of random testing with replacement. Is
this a useful reference point for practicing testers? Why or why not?
Answer
Outline: Obviously, testing strategies can
be implemented that performs worse than the upper-limit. I think that the value of this
upper-limit is in evaluating testing strategies. For a given strategy that has been examined, what is learned about it can be used by the practicing testers to
help them have level of confidence in their testing results.
8.
Praveen
Vaddadi: The
FSCS-ART strategy seems to work fine by reducing the number of test cases
necessary to detect first failure by up to 50%. How do you think would this
strategy do for higher dimensions, since in a real testing area input domains
usually are far from being one- or two-dimensional. I
also gather that in higher dimensions, failure patterns seem to be preferably
located around the center of the input domain. What do you think would be the
best approach with regard to FSCS method to solve this dimensionality problem?
Answer
Outline: I think that we saw earlier in the
semester that even for applications that seem to have a vast number of
combinations of parameters/options, this number can be greatly reduced by
examining what combinations are actually allowed by the software. I think that the problem of
multi-dimensions input domains is not unique to FSCS-ART and is a problem that
affects any testing strategy.
9.
Subodh Bakshi: I agree this paper deals with testing effectiveness, but how does
defining an exclusion region justify correctness? As in once the exclusion
region is defined, why do we (in a way) stop that failure region being
overlapped by another test case?
Answer
Outline:
10.
Samuel
Huang: The tessellation discussion seems
interesting, but it also seems limited in the types of unbounded regions that
can be represented. A quadratic
relationship, for example, seems problematic. Is there a class of shapes that we could formally describe
to be unrepresentable?
Answer
Outline:
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie: Why do
the authors assume the failure region is contiguous? They claim later that all
the theorems presented in the paper work when the regions are not contiguous,
but this is hard to believe without any evidence.
Answer
Outline:
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis:
Questions
posed by students: THIS SPEAKER IS NOT ACCEPTING NEW
QUESTIONS
1.
Nathaniel
Crowell: The authors site that they use
the Levenstein distance in determining the distance
between string parameters on an object.? Considering
how strings are typically used in software, does the use of this measure really
make sense? Should any weight be given to it when comparing objects?
2.
Amanda
Crowell: In the ARTOO paper, the
researchers stated that their object distance calculation only takes into
account the syntactic form of the objects and does not account for their
semantics.? They suggest, as an optimization idea,
that semantics should also be considered.? Should this
really be an optimization rather than an integral part of ARTOO?
3.
Christopher
Hayden: In the paper on ARTOO, the
authors specifically address the issue of boundary values for primitive types
by including a parameter that allows the test generation to probabilistically
select from a set of predefined values suspected to be likely to cause
errors. However, it is also
possible for more complex objects to have boundary "states" likely to
produce errors, yet the paper makes no provision for such a possibility. In many cases, these states may be very
unlikely to occur through random manipulation of the object and so are unlikely
to be reached by ARTOO. Do you believe
there is a case to be made for including a predefined pool of objects that can
be sampled probabilistically, much like there is for primitive types?
4.
Rajiv
Jain: Given the Pros and Cons of
pseudo-random testing techniques shown in these papers, when would you choose
to use random testing as opposed to the standard techniques we have learned in
class? When wouldn’t you?
Answer
Outline: I personally feel there isn't a
clear defined criterion to either go or not go with RT. I believe it is up to
the tester to decide based on few aspects. In addition to find this paper
mentioning general advantages of RT, I found a paper that mentions that RT
would be preferred when there is no clear idea about the properties of the
input parameters. The best approach then, in terms of time and effort, would
definitely be RT. Another paper “An Evaluation of Pseudo-Random Testing for
Detecting Real Defects” concludes that RT is good when the faults are easily
detectable. In that case the expected test length is low. But it isn't as
simple as that when the faults have low detectability.
So, the tester needs to make a decision based on what kind of
program/application that he is testing.
5.
Shivsubramani Krishnamoorthy
6.
Srividya Ramaswamy: For the "full distance definition" the values of all the
constants are taken as 1 except for alpha and R which are 1/2 and 0.1
respectively. On what basis are the values for these constants determined?
Answer
Outline: The authors have used 1 as the
values for most of the constants, may be, because they did not want to bias any
of the components in the equation. But, probably, 'alpha' is given a value of 0.5 to bring down the effect of
the component in the equation in case of large number of fields in the object.
The authors mention in the previous paragraph about averaging in the field
distance value to handle situations where objects have large number of fields.
7.
Bryan
Robbins: In the 2007 paper, the authors
fit the curve of bugs found over time to a function f(x)
= a/x + b. This curve is potentially useful in informing a timeout for random
test execution, but the values of a and b vary widely
depending on the application under test. What would a tester have to do in
order to utilize the curve as a model?
Answer
Outline: The parameters a
and b definitely would vary based on the size of nloc
and similar attributes. But I believe they may not impact the process here
because though they vary with different programs, they remain constant within a
program. And the in this process of finding fault rate, we would focus just on
one particular program which would keep a and b
constant throughout. Thus the focus would be on analyzing the values of f(x) for a series of values of x (x, x-1, x-2... x-n). The
values of a and b can play role in deciding upon the
value of n here.
8.
Praveen
Vaddadi: The
criteria the paper (ARTOO: adaptive random testing for object-oriented software)
used were the F-measure and the time required to reveal the first fault to
compare the RT and ART techniques. Do you think this is the right metric to use
because there is a good chance that either of the techniques can detect a bunch
of faults even if they find them late? I feel a metric that uses the number of
faults detected in a stipulated amount of time could throw a better light in
comparing the two techniques.
Answer
Outline: It is quite intuitive to ask this
question. I had the same question while reading the paper. I believe one simple
reason why the authors chose this criterion could be because it is a widely
accepted criterion and is used in many papers stating software-testing
experiments. But, I feel a better criteria should be to measure the time/test
cases required to reveal 'n' number of faults (instead of just 1). In that
case, the problem you stated would be handled better.
9.
Subodh Bakshi
10.
Samuel
Huang: One estimate given for running a
particular series of experiments assumes we brute force search over all
combinations of input parameters, which blows up exponentially in the input
size. Could we use some sort of a
guided search method automatically to reduce this? As an example, maybe we could perform some sort of a guided
random walk through the parameter space?
Answer
Outline: The brute force approach is not
the best one for sure. But, the basic idea of ART itself filters lot many
inputs (though the selection process is time consuming). Another approach the
authors specify for another reason is clustering the inputs. This could also be
a good approach to filter the inputs, such that an (or specified few) input
selected from a cluster automatically filters other inputs. Thus, you avoid
searching through a major part of the input space.
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie: How do
you deal with complex method contracts (preconditions)? For example there are two input arrays
that must have exactly the same length.
There may be something wrong with another class that would cause the “method
under test” to be run on invalid input.
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis: They proposed to use
ARTOO when the number of bugs found by RT starts not growing so fast, but
doesn't all the finding bugs techniques have a common growing rate that starts
decrementing with the time (when there are not to many more bugs to be found)?
Answer Outline:
Through the experiments, the authors showed that
fewer test cases would be required to find a bug using ARTOO since the input
objects are more evenly distributed. You are right in pointing out that the
authors haven't proved that ARTOO's rate of finding
the faults does not decrease over time. But, we need to understand that the
usual RT performance decreases over time since it exhausts all the possible
faults in the narrow area it covers. So the authors suggest that try to find
all these easily findable faults using RT (which it does faster) and then use
ART to find the faults distributed elsewhere with minimum cost (in terms of
required input objects). The rate of finding faults (which implies how fast you
find faults) has anyway gone down (with RT), so now focus on finding faults
with minimum object inputs.
Questions posed by
students: THIS SPEAKER IS NOT ACCEPTING NEW
QUESTIONS
1.
Nathaniel
Crowell
2.
Amanda
Crowell: The paper states that there are
disadvantages, discussed in another paper, to using live data to test databases
but doesn't list exactly what these are. It seems, however, that there are also
some advantages to using live data that were not discussed at all. Wouldn't the
AGENDA method be a good way to actively monitor your database to discover
problems as/if they occur?
Answer
Outline: The paper alludes to another
previous paper in this line to discuss various disadvantages of using live
data. The live data may not reflect a sufficiently wide variety of possible
situations that could occur. That is, testing would be limited to situations
that could occur given the current DB state. Even if live data encompasses a
rich variety of interesting situations, it might be difficult to find them,
especially in a large DB, and difficult to identify appropriate user inputs to
exercise them and to determine appropriate outputs. Also, testing with data
that is currently in use may be dangerous since running tests may corrupt the
DB so that it no longer accurately reflects the real world. Finally, there may
be security and privacy constraints that prevent the application tester from
seeing the live data.
3.
Christopher
Hayden: In the paper, the authors use the
category-partition method to populate the test database. They dismiss the possibility of random
testing, arguing that generation of valid inputs is likely to be inefficient
and that the inputs should be specifically selected so as to maximize the
likelihood of errors. There are
problems with both of these first arguments. First, constraints could be placed
on the random generation of data such that only valid inputs are produced. Second, several empirical studies have
shown, surprisingly, that the category-partition method is generally inferior
to random testing. Third, the process of generating random data can be augmented so as to
include values that are deemed likely to cause errors are included. In light of these facts, can you
provide a further explanation of the disadvantages of applying random testing
to databases, or do you believe that the authors may have dismissed the
possibility too lightly?
Answer
Outline:
4.
Rajiv
Jain: There seems to be no empirical
evidence presented in this paper about Agenda’s fault detection effectiveness,
so why should we use this tool to test database applications?
Answer
Outline: I agree that the paper describing
the AGENDA framework did not discuss its effectiveness and performance issues
in detail. However, the same authors have written a second paper on AGENDA
(http://portal.acm.org/citation.cfm?id=1062455.1062486) which mainly talks
about how various issues involving the limitations of the initial AGENDA are
resolved. In the end, they present an empirical evaluation of the new improved
AGENDA. They ran it on transactions from three database applications with
seeded faults and report that about 52% of the faults were detected by AGENDA
(i.e. 35 out of 67 faults). But this empirical study too has its own
limitations and threats to validity like the faults were
seeded by graduate students who were well versed with AGENDA. But
nevertheless, a study on its effectiveness and fault detection abilities has
been reported in the subsequent paper.
5.
Shivsubramani Krishnamoorthy: The authors, in the paper, describe about
the data generation for the database. I was wondering why the already existing
data was not made use of? I don't reckon the authors specifying anything about
it. Why was the trouble taken in generating the data, rather than using the
live data?
Answer
Outline: There could be many issues with
using live data for testing database applications. Live data may not reflect
sufficiently wide variety of situations and may be difficult to find the
situations of interest. Also, live data may violate some privacy and security
constraints as required by the database system.
6.
Srividya Ramaswamy: The paper mentions that the AGENDA prototype is limited to applications
consisting of a single SQL query. Could you discuss a few changes that would
have to be made if the prototype were to be extended to applications consisting
of multiple queries?
Answer
Outline: Extending the system to handle
multiple queries would most probably have the same underlying framework with
subtle differences. For example, one potential problem with having multiple
queries (and in turn having multiple host variables h1, h2, h3,..., hn) is that instantiating
host variables independently will create queries whose WHERE clauses will not
be satisfied by any tuples in the database. We would
like most of the test cases to cause the WHERE clauses in the queries to be
satisfied for some tuples in the database,
exemplifying more typical application behaviour.
Similar problem occur with UPDATE, INSERT and DELETE queries. This issue can be
handled in the extended AGENDA by distinguishing between type A (which causes
WHERE clause to be true for some tuples) and type B
(others can be designated type B) test cases. Heuristics can guide the extended
AGENDA to select type A test cases by taking account of dependence between
different host variables that are instantiated in a test case. In a similar
way, the extended AGENDA can have transition consistency checks and integrity
constraint checks, which are a high scale version of the proposed AGENDA
framework handling multiple tables and queries.
7.
Bryan
Robbins: One difficulty the authors
encountered was the inability to fully describe the state of the database. They
addressed this by having developers provide preconditions and postconditions, but this introduces more work and
potentially more faults. Could an approach like GUITAR's
Ripper be applied to the database to capture the runtime DB state? What are the
critical differences between DB state and GUI state?
Answer
Outline: AGENDA prompts the tester to
input the database consistency constraints in the form of post- and
pre-conditions, which are typically boolean
SQL expressions. These boolean
expressions are translated into constraints that are automatically enforced by
the DBMS, by creating temporary tables to deal with joining relevant attributes
from different tables and to replace calls to aggregation functions by single
attributes representing the aggregate returned. It can also be argued that GUI
testing is a "type" of database testing, wherein, if we consider the
case of Event Flow Graphs in GUITAR, it can be represented as a database table
with relevant constraints and pointer functionalities with each tuple being a mapping from event to event. Thus, apart from
the state consistency checks in a database system, I don't really think that
GUI testing is different from database testing.
8.
Praveen
Vaddadi
9.
Subodh Bakshi
10.
Samuel
Huang: The fact that the state generation
starts with the tester's sample-values files seems like an interesting
concept. Perhaps something more
formal could be done that explicitly adjusts the probability mass of inserting
specific tuples, to allow for some theoretical
analysis as to what types of tuples will appear in
the populated database.
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie: Previously,
for GUI systems we were told that writing operators for everything you wanted
to test in a system was too much effort.
Why is this any different for a program that uses a database system?
Answer
Outline:
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis: They used the
contracts and exceptions as oracle, did they consider non-error exceptions as a
bug?
Answer Outline:
AGENDA's approach to generate and execute test
oracles was based on integrity constraints, specifically state constraints and transition
constraints. A state constraint is a predicate over database states and is
intended to be true in all legitimate database states. A transition constraint
is a constraint involving the database states before and after execution of a
transaction. Thus there was no need to check for non-error exceptions in this
scenario.
Questions posed by
students: THIS SPEAKER IS NOT ACCEPTING NEW
QUESTIONS
1.
Nathaniel
Crowell: None of the presented technique
seems to really get out ahead of the others as clearly better; however, what do
you see as the biggest advantage that user session data provides?? Can this
advantage be applied to testing in general (assuming costs are not a concern)?
Answer
Outline: The basic contribution of User
Session Testing is sampling of the input space that approaches or beats the
performance of more rigorous techniques in an automated fashion. The approach
taken by the authors could be used with testing any system that requires user input,
but web servers provide an easy way to observe and capture user interactions
with server logs or packet captures that may not be present with other
software. It isn’t clear that this technique is required outside of Web or GUI
applications because traditional white box testing techniques probably can be
performed in an automated fashion with better results.
2.
Amanda
Crowell: You discussed several issues
with the way the researchers performed the experiment? If you could suggest
changes they should make to make the experiment better, what would they be?
Answer
Outline:
3.
Christopher
Hayden: In class we discussed in some
detail the similarities between web applications and GUI applications. The paper proposed a variety of ways of
using user sessions to create test cases for web applications. One way of generating test cases for a
GUI application is to have a tester create a session and record it for later playback,
but to the best of my knowledge no one has suggested using actual user sessions
for GUI testing. Do you feel that
there's value in extending this technique to standard GUI applications, or does
it rely on web-specific characteristics such as high user volume and a
client-server infrastructure? Can
you suggest ways by which any such constraints can be overcome? Alternatively, does the lack of
attention to this idea in GUI testing suggest anything negative about its
perceived effectiveness?
Answer
Outline: The ideas behind user session testing
can be applied to any software, but the server architecture certainly makes
implementation much easier in web applications. There are certainly tools in
GUIs already collecting user sessions such as toolbars in web browsers and the
Usage Data Collector in eclipse. I would event venture to guess crash reporting
tools becoming popular in commercial GUIs are collecting the tool usage
immediately prior to a crash for similar to a test case in a capture replay
tool. I therefore believe it would not be too much work to extend this idea to
the realm of GUIs. Unlike with web servers there is no protocol like http
governing interaction with the GUI program so the implementation of collecting
user sessions could become application specific limiting its widespread
usefulness. I believe the reason that this has not been observed with GUI
testing is that a better automated technique already
exists. The test cases produced by User Session Testing in GUIs would be a
subset of the test cases that could be generated by the GUITAR framework. The
next step to User Session Testing would be to develop a system similar to
GUITAR for web testing.
4.
Rajiv
Jain:
5.
Shivsubramani Krishnamoorthy: As I had mentioned in the class, all I
thought about the web application to be different from a normal application is
its real time usage. When reading the paper, I was expecting the authors to
stress on real time data. Do you think the data recorded from the user session
play role of real time data? I was not able to visualize how. Do you think the
experiment is incomplete in this sense?
Answer
Outline:
6.
Srividya Ramaswamy: The paper mentions that during the Test Suite Creation for the
experiment the users were asked to choose two of the four given courses. I am
sure that the percentage of users choosing a particular course has a huge
impact on the session data generated. However, the paper does not talk about the
percentage of students who chose each of the courses. Can you comment on this?
Answer
Outline:
7.
Bryan
Robbins: The authors discuss that the
cost of capturing additional user sessions is significant, and suggest a number
of filtering and combining techniques for making the most out of the data
given. There exist a number of techniques (e.g. Expert systems, cognitive
architectures, etc.) that are capable of modeling tasks under constraints
similar to those of human cognition. While these would be somewhat imperfect,
is there a reason to think that task models would be capable of generating
session data, or are there aspects of users that make them irreplaceable?
Answer
Outline: Using humans will always
represent the subset of realistic input better than any other representation,
but close representations could certainly be modeled through smart web
crawlers. Both approaches are automated from the tester’s viewpoint so why not
use both if User Session test cases are sparse. The better approach between the
two is debatable, but the larger issue is whether an emphasis on realistic
user-sessions is desired. Either approach would likely result in an
oversampling of certain test-cases while entirely
missing certain faults. A better approach, if possible, would be to find a way
to automatically generate a graph that models the web application similar to Ricca and Tonella. One then could
traverse the graph with a number of algorithms to generate more diverse test
cases and then traverse all edges or all nodes in addition to using the user
sessions.
8.
Praveen
Vaddadi: Testing
the security aspect is critical for web testing. Clearly, this involves testing
the security of the infrastructure hosting the web application and testing for
vulnerabilities of the web application. Firewalls and port scans can be
considered for testing for infrastructure, however how do we test the
vulnerabilities of the web application?
Answer
Outline:
9.
Subodh Bakshi
10.
Samuel
Huang: The graph representation proposed
by Ricca and Tonella for
modeling web applications seems appealing. The notation used to describe path expressions on these
graphs (such as ``(e1e3 + e4e5) * (e1e2 + e1e3 + e4e5)'') looks reminiscent of
Pi-Calculus/CCS as proposed by Milner in the 1980s. In that work, bisimilarity between
nodes was used as a notion of equivalence. Could we use the notion of bisimilarity
in generating test cases that go beyond independent paths, and also avoid ``bisimilar'' edges?
Answer
Outline: I am not sure that is desirable
to avoid bisimilar edges in web applications because
even small changes in the state of the program can result in execution of
different underlying code. For example, while e1e3 and e4e5 could be considered
bisimilar, different portions of the Perl code are
likely check the two different form variables. It also seems as though the cost
savings over linear independent paths would be insufficient to warrant their
exclusion.
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening: Seven
groups are designed for the experiments, including WB-1, WB-2, US-1, US-2,
US-3, HYB-1, HYB-2. Since the test suite size varies, can we simply say that
US-3 and WB-2 is better than the others? And I think HYB-1 and HYB-2 should
perform better than the others since it is the combination of white box testing
and user-session techniques, but the results turned to be not, what do you
think maybe the potential reasons?
Answer
Outline:
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis: When the authors did
the user session slicing and combining did they use any heuristic to choose the
mixing point?
Answer Outline:
Questions posed by
students:
1.
Nathaniel
Crowell: Generally, how can
capture-recapture be applied to GUI testing? How do various components of the
system translate to software testing and more specifically to testing of GUIs?
2.
Amanda
Crowell: Could using virtual inspections
have biased the results regarding the appropriate number of inspectors to use?
3.
Christopher
Hayden: We saw in the paper that all of
the estimators underestimate the total number of faults and that they all have
a large variance. We also
discussed the fact that there are seemingly contradictory results, with one
paper recommending the Jackknife estimator and another indicating that the
Jackknife estimator does quite poorly.
Given that in actual software development there are insufficient
resources to perform an experiment to determine the best estimator to apply to
a given project, would there be any value in applying all of the estimators and
merging their results in some way?
4.
Rajiv
Jain: Do you think that capture recapture
is a more valid measurement of faults then using simpler measures such as the
line count or code complexity to compute the ratio of errors in the code?
5.
Shivsubramani Krishnamoorthy: It is mentioned in the paper (section 2.2)
that defects captured by more than one inspector will usually have a higher
probability of detection, which sounds quite obvious. The authors claim that
the estimators depend on the order of trapping occasions. Do you think it holds
good in inspection context? How do you think does the order of trapping
influence the estimators?
6.
Srividya Ramaswamy: The paper talks about the models Mh and Mt.
In Mh we assume that every defect j has probability pj of being detected, which is the same for every
inspector. In Mt every inspector has probability pi of detecting every defect.
I understand that the paper deals with documents. However, if these models were
applied to Software Testing do you think the estimators using these models
should assume anything regarding the code coverage? When two inspectors
(testers) use overlapping test cases their detecting capability is bound to be related. Isn't it?
7.
Bryan
Robbins:
8.
Praveen
Vaddadi: The
paper tries to address the impact of the number of faults on the CR estimators.
However, I did not find a direct solution to this issue in the paper. Do you
think a consensus can be arrived upon the minimum number of faults that have to
be present in a document before the CR estimators can be used using the
experiment described in the paper?
9.
Subodh Bakshi
10.
Samuel
Huang: In the discussion section of the
paper, the authors comment that ``failure rates for low numbers of inspectors
were high, and therefore several models should be used to prevent failure to
obtain an estimate.'' Does this
mean that the authors want to aggregate different sets of few observers in some
meta-sense, or raise the number of inspectors? (or
something else?)
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie: Different
inspectors are still likely to find many of the same defects, even if they are
not colluding just because they are using similar techniques to look for
defects. Do you think that having
the inspectors find defects in a document with a known number of defects and
filtering out inspectors that overlap too much would help make the estimation
of defects in general with the remaining inspectors better?
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening: Based
on mounts of experiments and analysis, the author recommended using a model
taking into account that defects have different probabilities of being detected
and the Jackknife estimator. And I noticed that this is a paper published in
2000. So I wonder whether these recommendations have being applied in real
practice: is it useful in real industry? Is it accurate?
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis:
Questions posed by
students: THIS SPEAKER IS NOT ACCEPTING NEW
QUESTIONS
1.
Nathaniel
Crowell: Are there any novel
components/techniques implemented by RUGRAT that make it stand out in
comparison to other forms of fault injection?
Answer
Outline: Based on the related work
presented in the paper, I think the main contributions of the RUGRAT
method for testing involve its level of automation, level of tester workload,
and lack of changes made to code.
Compared to the other approaches listed in the paper, RUGRAT seems to be
the most automated. Other
approaches require the tester to write stub functions or wrappers, thereby
increasing the workload of the tester.
In RUGRAT, the source code is analyzed and the tester only has to
provide the locations he/she wants tested, the types of tests to create, and
the oracles. RUGRAT does
everything else. Also, other
approaches require changes to the source code, which in my opinion means only
bad things: why would I want to modify the source that I'm trying to test? So RUGRAT has several advantages
compared to the approaches mentioned in the paper, but I'm unsure of its
comparison to any approaches that have happened since.
2.
Amanda
Crowell:
3.
Christopher
Hayden: The paper mentioned that RUGRAT
is capable of identifying sites where indirect function calls are made, but
that it is not capable of distinguishing between calls to different functions
from that site. In other words, if there are two functions that can possibly be
called indirectly from a site, RUGRAT has no way of distinguishing, which one
is called. The paper also mentioned that the authors are working on addressing
this issue. It seems like it would be very difficult to distinguish between indirected functions at compile time without extensive and
complicated static analysis. Do you have any sense for how they are going about
accomplishing this?
Answer
Outline:
4.
Rajiv
Jain: RUGRAT seems like it could be a
very powerful tool. Where else do you think RUGRAT could be applied outside of
resource and function pointer errors?
Answer
Outline: The authors mention that they had
a previous instantiation of RUGRAT that tested mechanisms that protect against
stack "smashing" attacks.
Stack attacks are, like resource and function pointer attacks, very
difficult to do and simulate.
Having something like RUGRAT makes simulating these attacks easier. I think the authors present RUGRAT as a
method of test case generation, and are systematically going through various
instantiations that approach different targets to demonstrate how the method
can be used. They give a (brief)
description of how RUGRAT is changed from instantiation to instantiation and
list the following as future (possible) uses of RUGRAT:
- Test
security policy enforcement techniques that restrict an applications access to
certain resources
-
Simulate an application that has been taken over by an attacker
-
Implemented for java, handling specifics of language differences and exception
raising semantics
5.
Shivsubramani Krishnamoorthy: The main approach of the paper is using
dynamic compilers for the testing. With the example mentioned in the paper, I
get the feeling that the situations are more or less some form of exceptions,
which need not be simulated dynamically. They can very well be predefined. The
author has used the same example of memory availability in couple of his talks
too. Do you think that it is a case of selecting a better example or is dynamic
compiler really not required?
Answer
Outline: I'm not entirely sure what you
mean by "predefined", however, I assume you are meaning that the
tester can predefine errors to occur in their tests. The reason for using the dynamic compilers is that (1) the
tester doesn't have to know ALL the locations that the error can occur and (2)
they don't have to make errors for all these
locations. The dynamic compiler
will locate all the places a certain error can occur and then just
automatically create errors, which significantly reduces the effort the tester
must put forth as well as reduces any bias the tester might have.
6.
Srividya Ramaswamy
7.
Bryan
Robbins: The inputs used by the authors
for code coverage analysis were either "distributed with the program or
developed by the authors to provide adequate coverage for our analysis."
However, in the case of subject applications from the SPEC and MiBench benchmarks, these "original inputs" were
designed for performance evaluation and not test coverage. Is comparing their
coverage of EH code to RUGRAT's coverage still valid?
Answer
Outline:
8.
Praveen
Vaddadi: With
regard to the "malloc" example in the
paper, how different is the simulation of "out-of-memory" by RUGRAT from
the buffer overflow scenario caused by the user himself (this could be
automated too)? That is, if we use a "gets" function (which stores
user's input into a buffer and continues to store until end of line or end of
file is encountered) which can take any amount of user input data and thus
causing the buffer overflow if the data is too large, will it be different from
the buffer overflow simulation (setting
"errno" to "ENOMEM")?
Answer
Outline:
9.
Subodh Bakshi
10.
Samuel
Huang: The observation made by the authors
regarding the ``DynamoRIO dynamic compiler not
considering any contextual information'' is interesting. If a tool were available that *did*
consider such information, could we hope to do better, perhaps specifically for
the pointer protection mechanism discussed? Perhaps some sort of information/data flow analysis could be
combined to further improve performance?
Answer
Outline: This question is similar to
Chris' question, and I decided to address both here. First of all, I think that
including contextual information would definitely be more complex and would
almost certainly require more intensive static-type analysis. The authors implied that there were
dynamic compilers that incorporate contextual information, but I do not know
how these work
and what effect they would have on RUGRAT functionality. In
regards to performance, I think the effectiveness of having contextual
information depends on the tester's goals and the types of functions under
test. If you were testing
functions that operate differently based on the call environment (i.e. the path
that led to them being called and state of memory), then contextual information
could be very useful and show problems that wouldn't be uncovered with the context. But if the functions under test don't
really depend on the environment then the context may not matter. Also, if you
add in contextual information, then the time required to
analyze results of testing increases greatly (you'd have a result for
every time a function was called).
So it really depends…
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening: RUGRAT
requires no modification to the source or binary program. So for those
techniques for generating test cases, what are the disadvantages of those
requiring no modification to the source, comparing with those requiring
modifications? Will the former ones miss any information in source code?
Answer
Outline:
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis:
Questions posed by
students:
1.
Nathaniel
Crowell: As is the case for most forms of
testing, how do we know when enough testing has been done with software fault
injection?? Do any of the relevant papers provide some analysis of the costs
(monetary and time) for testing with software fault injection?
2.
Amanda
Crowell: What kinds of processor features
are used to inject faults using Xception?? What sort
of techniques can be applied for processors without these features?
3.
Christopher
Hayden: In the presentation, you
mentioned the use of hardware emulation coupled with fault injection to
simulate hardware errors.
Obviously emulation is not perfect or there would be no need to perform
fault injection using actual hardware.
How well do the results from hardware emulation and actual hardware
correlate? If we test software
using hardware emulation and fault injection, how certain can we be that the
results from that test translate well into the hardware domain?
4.
Rajiv
Jain: The methods you presented seem to
exclusively deal with transient hardware faults, but how does one cope with
detecting and handling a permanent hardware fault that propagates through the
error checking?
5.
Shivsubramani Krishnamoorthy: Research in the field of fault injection has
been focusing on the idea of processor independence. FERRARI requires the
specification of processors instructions set. The authors of a fault injection
tool - FITgrind claim that as a disadvantage of
FERRARI. Do you really think that is a big disadvantage, in practical sense?
The instruction set for a particular class of applications is going to remain
constant. So, is that really an issue?
6.
Srividya Ramaswamy
7.
Bryan
Robbins: We discussed that protocol fault
injection may have some utility in the testing of component-based software.
Could you describe how we might generate a "Class Not Found" case
scenario via protocol fault injection?
8.
Praveen
Vaddadi: I
understand that there are various advantages of hardware fault injection like
Heavy-ion radiation method can inject faults into VLSI circuits at locations
that are impossible to reach by other means and that experiments can be run in
real time, allowing for the possibility of running a large number of
fault-injection experiments, etc. What are the potential disadvantages of using
hardware fault injection techniques? One disadvantage I could think of is that
some hardware fault injection methods, such as state mutation require stopping
and restarting the processor to inject a fault, it is not always effective for
measuring latencies in the physical system. can you
add more to this?
9.
Subodh Bakshi
10.
Samuel
Huang: We could think of a coverage output and other related statistics as a
fingerprint of a system. Could we
devise a way to generate such a fingerprint a priori for a ``correct''
implementation of the system, and observe the effects such a fingerprint of
introducing faults via fault injection?
Perhaps if we store this expected ground truth finger
print somehow, we could flag when modifications of a code base causes a
radical change in program behavior.
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening: I agree that there are a lot of advantages
of hardware fault injection, but what is the relation between hardware fault
injection and software fault injection? Do you think they are complementary or
one maybe replaced by the other?
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis:
Questions posed by
students:
1.
Nathaniel
Crowell
2.
Amanda
Crowell: What are some of the unique
issues associated with testing databases as compared to testing other types of
software?
3.
Christopher
Hayden: We've seen a few papers thus far
that treat database testing as unique.
In each case, the cited reason is that the expected behavior of the
database in response to an operation is very sensitive to the database
state. However, such state
sensitivity is not at all unique to databases; plugin
based architectures are commonly used share this feature, the behavior of such
an architecture depending on the active plugins at
any given time. Considering this,
can you draw a distinction, between database testing and testing of any other
software that is state sensitive, that justifies the
treatment of database testing as a separate topic? Or is it merely a special case of a larger class of software
testing?
4.
Rajiv
Jain:
5.
Shivsubramani Krishnamoorthy: The authors have focused mainly on a
stand-alone DB system application. In section 1 they have listed a set of
questions that need to be addressed. DBs are used,
these days, more in a distributed fashion over web/networks. What additional
questions/issues, do you think, need to be addressed to test a web application
using distributed database systems?
6.
Srividya Ramaswamy
7.
Bryan
Robbins: The authors argue that DB
testing requires special consideration because "the input and output
spaces include the database states as well as the explicit input and output
parameters of the application." This seems to indicate that side-effects
on the DB state are more important in DB testing, as opposed to other forms of
event-driven testing (such as GUI testing) where only expected output is
important. If we did consider application state information as an "output"
in GUI testing, what are the potential tradeoffs?
8.
Praveen
Vaddadi: What
potential problems do you see in performing regression tests for database
applications? Various techniques for selecting subset of tests for re-execution
have been proposed but they assume that the only factor that can change the
result of a test case is the set of input values given for it, while all other
influences on the behavior of the program will be constant for each
re-execution of the tests. Clearly, this assumption would be impractical for
testing database applications since in addition to the program state, the database state also needs to be taken into
consideration. What other issues do you think will be problem-some in applying
the naive test selection techniques for regression tests of database
applications?
9.
Subodh Bakshi
10.
Samuel
Huang: One major issue seems to be how decide the database's _state_, i.e. the set of tuples that populate a given schema. Strategies proposing synthetic data for
this task seem to trade-off between producing a state that is _common_ in
practice vs a state that helps to uncover faults -
these two criterion oftentimes may not coincide. What types of strategies exist that focus on one criterion vs the other? Perhaps the realm of probabilistic
databases may facilitate in generating ``common'' states, and also support
correlations between tuples, etc.
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening: The
tool includes four components and I think there is some overlap about the third
and fourth component. For example, if we use "select", the output
should be the selected items and state of the tables should keep the same, at
this time, the third and fourth component are doing different things. But if we
use "insert", the component 3 should check if it is inserted
successfully, and the component 4 should check the state after inserted. At
this time, component 3 and 4 are doing the similar things, and in fact, it is
enough to just use component 4. It seems that the state can represent the output
to some degree. So do you think that component 3 and 4 can be combined? If yes,
do you have any sense to design new component 3 and 4 more efficiently?
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis:
Questions posed by
students:
1.
Nathaniel
Crowell: In describing the requirement
for "form layout" the authors cite the claim that "the ideal
sequence that is most commonly accepted is top left to bottom right".
Wouldn't this actually vary based on the intended set of users? For instance,
for users whose native language is written right to left, I'd expect that some
right to left format would be the appropriate sequence. Should usability
requirements instead be formed in the context of the intended users rather than
trying to form some general set of guides that should apply in all situations?
2.
Amanda
Crowell: The authors of this paper stated
that 6-8 test users are usually sufficient to identify most of the major
usability issues. Our previous usability testing paper stated that 5 was the "magic number". Based on your reading and
understanding of usability testing, do you have any idea what kind of factors
are leading to the difference in number of users?
3.
Christopher
Hayden: The authors of the paper cite the
phenomenon of psychological pollution, where users act differently due to the
knowledge that they are being monitored, as a major threat to validity in
usability testing. It seems that
this effect may be impossible to eliminate in any formal testing environment. Are you aware of any research, or do you
have any ideas, to address this threat in a way that still yields feedback of
sufficiently high quality?
4.
Rajiv
Jain: It seems there are certain
conventions when it comes to the aesthetics and layout of a GUI. Do you think
machine learning techniques could be used to automatically detect metrics for
these conventions and score a GUI based on its deviations from what is
“normally seen”? Could this automate usability testing further?
5.
Shivsubramani Krishnamoorthy
6.
Srividya Ramaswamy
7.
Bryan
Robbins: As we discussed in class,
although look-and-feel play a part, the user experience is often dominated by
cognitive factors. Unfortunately, these are inherently hard to automate, but
could fit in to a hybrid approach (like that of the HUI) which combines some
user data with expectations. So what are some metrics that could represent
cognitive aspects of the user experience such as accuracy and response time?
How could we encode the new expectations into expected action sequences?
8.
Praveen
Vaddadi: I
agree that it is impossible to fully automate the usability study. However, the user behavior can be mimicked by a ruled-based program
that considers temporal aspects (as discussed in the class) as well. At
this point, we need to ask whether are there reliable rules for user behavior
on the web and if we really want to remove the users from the user test. What
aspects do you think can never be automated in a usability study?
9.
Subodh Bakshi
10.
Samuel
Huang: The HUI Analyzer's matching system
looks similar to string matching algorithms. If not done yet, could we leverage that body of work to give
more interesting types of alignments, such as partial matches, where
``partial'' here means approximate with some criterion? Would these partial matches be useful
in the context HUI is working over?
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening: I
understand that in the HUIA testing, the recorder will keep track of what the
user has done and record them in an AAS, and then compare AAS with EAS. But how
can we get EAS? In another way, how can we know what exactly the user want to
do?
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis:
1.
[Nov. 19] Student presentation
by Ghanashyam Prabhakar. (sources: Hermes:
A Tool for Testing Mobile Device Applications, 2009 Australian Software
Engineering Conference, http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05076634). Slides: Ghanashyam.pdf
Questions posed by
students: THIS SPEAKER IS NOT ACCEPTING NEW
QUESTIONS
1.
Nathaniel
Crowell: How is requirement #3
accomplished (Consume minimal resources on mobile devices)?
2.
Amanda
Crowell: For the 6 requirements for
Hermes set out in the paper, are there any that are more important than
others?? In other words, if I had to choose between two different testing
tools, neither of which met all the requirements, how would I know which one to
pick?
Answer
Outline: The authors, on the outset set
quite high requirements, all of which I believe are important. But if I were to
choose one requirement amongst the six they set, I feel the first one, i.e.
"Automatically deploy applications, execute tests and generate
reports" would be the more important. Because this requirement though very
generically stated encompasses the basic requirements for automated mobile
application testing. Also a tool that can address the problem of heterogeneity
in mobile platforms would be considered better.
3.
Christopher
Hayden: The authors say something strange
in passing in the abstract of the paper.
Specifically, they state that Hermes is currently more expensive to
employ than manual testing. This
seems odd considering two facts.
First, in the experiment presented, Hermes is superior to manual
inspection in the verification of aesthetic characteristics of a GUI. This suggests that Hermes proficiency
should increase relative to manual testing when less obvious flaws are
considered. Second, tests written
in Hermes at some up front cost can be executed and inspected quickly from that
point forward, whereas tests inspected manually are always expensive. Do you feel that the authors'
conclusion that Hermes is more expensive than manual testing is justified given
the relatively simple experiment performed?
Answer
Outline: From the authors' statement in
the abstract, the keyword I would pick is "currently". Employing
Hermes is definitely more expensive currently due to the fact that tests have
to be manually written since the GUI Visual Modeler has not been
designed/implemented as yet. So the upfront cost you talk about is what is
being taken into account when the authors state that Hermes is currently more
expensive. Of course having said that their relatively simple experiment
could've been made more elaborate and they intend to do that in their future
work.
4.
Rajiv
Jain: What are the biggest difference
between testing a Mobile device and a normal PC? Are the lines getting blurrier
with Mobile Operating Systems?
Answer
Outline: With regards to mobile devices
the main differences as compared to a PC would be the environment and the
heterogeneity in mobile device platforms . So with
regards to testing applications on mobile devices - the environment and testing
across the heterogeneous mobile platforms would be priority.
5.
Shivsubramani Krishnamoorthy
6.
Srividya Ramaswamy
7.
Bryan
Robbins: On the PC subsystem side of the
Hermes architecture, do you think it would be feasible to use existing tools
for test scripting and execution? In other words, once we remove them from the
heterogeneous underlying devices, do we expect mobile applications to be all
that different from one another?
Answer
Outline: You make a very good point, yes,
they can very easily use existing tools for test scripting on the PC subsystem
side. However for test execution the issue of device heterogeneity does come
into play and this can be handled in the way the mobile subsystem is designed
for each device.
8.
Praveen
Vaddadi: The
paper discourages the use of device emulators for testing. However, if we are
testing with real devices, their limited processing power and storage does not
allow on-board diagnostic software to be loaded, so they lack instrumentation.
With real devices, we will not be able to record the protocols going back and
forth between the device and the application, and this will limit the ability
to isolate problems and make corrections. Do you see this as a potential
disadvantage?
Answer
Outline: While black box testing of mobile
applications, the question is do we really want to know what's going on behind
the scene. I feel it's not much of a disadvantage because we are not really
bothered to know what parameters are being passed or what protocol is being
called to actually run that piece of application. When it comes to actually
debugging the software on the mobile device to try and figure out what went
wrong, well I guess it does handicap the developer a little. Probably the
reason why the authors suggest future work to include developing probes
allowing device data to be extracted that otherwise cannot be accessed using
the J2ME API.
9.
Subodh Bakshi
10.
Samuel
Huang: Configurations seems to play a
large role in the challenges of programs passing (at least) the ``aesthetic''
type of tests. However, the
ability to perform automated test validation on a set of tests of this type
(even if we restrict ourselves to a narrower set than ``all' aesthetic tests) seems
appealing, almost as if we are working towards an objective function to be used
in optimization. Given that the
GUIs are often described in XML, could we push this further and develop a
search-like algorithm that considers different combinations of configuration
parameters and arrives at arguably ``good'' results (namely a nice looking
layout)?
Answer
Outline: Well most definitely you could,
but the problem here I see would be obtaining the various combinations of
`configurations'. For example an application which passed all tests on a Nokia
N97 might fail miserably on a newer version of Nokia N900 maybe due to the fact
that Nokia N900 uses an advanced Linux based operating system called Maemo OR maybe even because of the way the application is
rendered on the touch screen the way it is, so the point is, its quite
difficult to model the various combinations. So I feel its best to actually run
the test on the test device rather than trying to model its behavior. But IF
someone were to actually try and build an emulator the solution you have
presented definitely seems to be an optimal one.
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening: (1) I
think Hermes is the automation of the manual testing for mobile device
application. So if we use the same test cases, I guess the coverage will not
change, but why the result of Hermes is much better than manual testing? Where
are the improvements from? (2) Hermes took 25 min for the testers to specify
the test cases, much longer than manual testing. So I think there must be some
way to make it easier, for example, through using GUI to generate the xml for
test cases automatically. So do you have any sense for how to do this?
Answer
Outline: (1) I guess the measure of
comparison is time taken for testing across various mobile devices, in that
sense they say Hermes is better than manual testing. (2) Yes, I feel too that
the GUI Visual modeler can be modeled based on some existing tools, for example
like GUITAR.
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis:
1.
[Nov. 24] Student presentation
by Hu, Yuening. (sources: Clustering test cases to achieve effective and
scalable prioritization incorporating expert knowledge, Shin Yoo; Mark Harman; Paolo Tonella; Angelo
Susi, Proceedings of the eighteenth international symposium on Software testing
and analysis, 2009, Testing #2, Pages 201-212, DOI=http://portal.acm.org/citation.cfm?id=1572272.1572296&coll=GUIDE&dl=GUIDE&CFID=61019873&CFTOKEN=44062317). Slides: Yuening.pdf
Questions posed by
students:
1.
Nathaniel
Crowell: How is the Analytic Hierarchy
Process used in the clustering framework?
2.
Amanda
Crowell: Briefly describe the steps of
the clustering framework presented in the paper (Clustering, Intra
prioritization, inter prioritization, generate best order, evaluation).
3.
Christopher
Hayden: In the paper, the clustering is
based on a code coverage metric (tests covering similar code are grouped
together) and the intra-cluster ordering is also based on a code coverage
metric. The only human input is
the inter-cluster ordering. It
seems that the real effect of arranging the tests in this way is to maximize
the distance of consecutive tests, thereby increasing the amount of tested code
quickly, and that the effect of ordering the clusters (i.e., human input)
should be minimal. Suppose we
greedily order the tests in such a way that we always execute the test that
covers the most code that has not been covered by a previously executed
test. Would you expect this
arrangement of tests to perform similarly to the clustering technique?
4.
Rajiv
Jain:
5.
Shivsubramani Krishnamoorthy
6.
Srividya Ramaswamy
7.
Bryan
Robbins:
8.
Praveen
Vaddadi: The
combination of AHP and ICP has been empirically evaluated and the results
showed that this combination of techniques can be much more effective than
coverage-based prioritisation. One surprising finding
was that an error rate higher than 50%, i.e. human tester making wrong
comparison half the time, did not prevent this technique from achieving higher
APFD than coverage based prioritisation. How do you
explain this unexpected finding? Are there more factors in addition to the
effect of clustering influencing this result?
9.
Subodh Bakshi
10.
Samuel
Huang: The authors state that the ideal
similarity criterion that could be used for clustering is a ``similarity
between faults detected by test cases, however this information is inherently
unavailable before the testing is finished.'' They instead rely upon a Hamming distance measure between
statement coverage of pairs of test cases. Is this a good measure? What other measure/information could we
incorporate into our model to potentially improve the clustering?
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis:
Questions posed by
students: THIS SPEAKER IS NOT ACCEPTING NEW
QUESTIONS
1.
Nathaniel
Crowell
2.
Amanda
Crowell:
3.
Christopher
Hayden
4.
Rajiv
Jain:
5.
Shivsubramani Krishnamoorthy
6.
Srividya Ramaswamy
7.
Bryan
Robbins:
8.
Praveen
Vaddadi: With
regard to value spectra and function execution comparison, how does the
comparison of values of pointer-type variables work?
Answer
Outline: I will use the terms references
instead of pointers. In value
spectra analysis, the values of the references themselves are not checked. It would make no sense to do so as reference
values vary due to memory allocation and are unlikely to be the same across
multiple executions. However, the
values of the referents are compared.
Thus, when analyzing function entry/exit states, two states are
equivalent if and only if all referents in the scope described in the paper are
equivalent. In the case of dynamic
data structures, in which one referent contains a reference to another
referent, the analysis recurses to a depth determined
given by a runtime parameter. Two
states are equivalent if and only if all referents encountered in the recursion
are equivalent.
9.
Subodh Bakshi
10.
Samuel
Huang: The sensitivity of value spectra
to code refactoring seems somewhat problematic. In particular, I would think that one wouldn't want the
(arguably common) refactoring of (for example) one function into several
smaller, more modular functions, should be considered a big change in the
program flow. However, it seems
like value spectra may consider it as such in terms of a distribution
adjustment. Is there a way to
post-process or otherwise further summarize existing value spectra that would
make them more agnostic as to how specifically the program flow occurs, but
still detects flaws in the general flow itself?
Answer
Outline: Value spectra analysis closely
resembles forms of dynamic data flow analysis. A lot of research has been done on dynamic data flow
analysis and many useful heuristics have been developed. While there has been no research (as
far as I can tell) on applying such heuristics to value spectra analysis, I suspect
that they, or something similar to them, would be useful in improving the
detection capability of value spectra by reducing the false positive rate. In general, questions arising in data flow
analyses are undecidable. I would not be surprised if determining which changes within
a value spectrum are relevant is undecidable in the
general case. All of that said, there are of course more practical things that can be
done, such as selectively instrumenting code to
include only those functions that wrap a call to the modified function or to
exclude the modified functions.
However, selective instrumentation is not applicable in the general case.
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis:
Questions posed by
students:
1.
Nathaniel
Crowell: What are the differences between
exhaustive-cover configurations and the direct dependency-cover configuration
described in the paper?
2.
Amanda
Crowell: Briefly describe the components
of the Annotated Component Dependency Model (ACDM), the Component Dependency
Graph (CDG) and Annotations. What
relationships does an ACDM show?
3.
Christopher
Hayden: Focusing on direct dependencies
seems like a natural approach. In
a configuration where A depends on B and B depends on C, it should not matter
to A what version C has provided B builds and executes correctly. Given that, if I'm a developer for
project A, what interest do I have in looking at any dependencies at a distance
greater than 1 from A in the dependency graph? Is there an example of a case in which a dependency at
distance 2 from the SUT causes the SUT to fail, but does not cause the
dependency at distance 1 to fail?
4.
Rajiv
Jain:
5.
Shivsubramani Krishnamoorthy: There are a few commercial compatibility
testing tools which I learnt about from the internet that claim to have lot
many features supporting wide range of hardware an software. For example, Swatlab
(http://www.swatlab.com/qa_testing/compatibility.html). How do you compare them
with Rachet?
6.
Srividya Ramaswamy
7.
Bryan
Robbins:
8.
Praveen
Vaddadi: How do
the developers build the initial configuration space for the SUT? What about
the components that indirectly influence the software under test?
9.
Subodh Bakshi
10.
Samuel
Huang: It would be satisfying if we only
needed to inspect the configurations of direct dependencies, as the paper
describes. However, it might informative to, in a somewhat crude way,
investigate whether or not the SUT (namely the root node of the CDG) implicitly
uses information that is not supposed to be exposed by the dependency APIs
(perhaps we assume that a function behaves in a certain way, or uses a certain
data structure as a backend, when we are not explicitly given this in the
utilized API). One way this may be
violated is to investigate the effects of using variable configurations for the
direct dependencies of the dependency itself, namely the second-level
dependencies from the SUT. Could
we detect such a violation of an API by observing the compilation (or perhaps
some code execution?) of the SUT under such secondary dependency
considerations?
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis:
Questions posed by
students:
1.
Nathaniel
Crowell: What were the four mutant operators
used by the researchers? Can you think of any additional operators that would
be useful?
2.
Amanda
Crowell: The researchers for this paper
sought to determine if mutation testing is an appropriate method for testing
experiments. But what are the
problems associated with finding test programs that mutation testing is meant
to handle?
3.
Christopher
Hayden: In the paper, the authors filter
out mutants that produce no detectable change from the oracle. Later they argue,
based on results from the Space program, that mutants seem to introduce that
are detected at a rate similar to that of real faults and that faults
introduced by hand are harder to detect than real faults. Are these results valid in light of the
filtering? What use are they in
actual software development where we don't have a pristine program to compare
to? We saw in an earlier
presentation that faults do not always propagate to output. Isn't is possible that mutants that did
not yield detectable changes from the oracle more closely resemble the faults
introduced by hand and by filtering them the authors bias their data toward
easily detectable mutations?
4.
Rajiv
Jain:
5.
Shivsubramani Krishnamoorthy: I understand an equivalent mutant is the one which is syntactically different from the original
program but behaviorally same. Thus, detecting and killing such mutants is very
difficult. Do you agree that the process of filtering out equivalent mutant is
an overhead, or is the extra step worth taking?
6.
Srividya Ramaswamy
7.
Bryan
Robbins:
8.
Praveen
Vaddadi: I
understand from the paper that while there is probably some minimum number and
type of mutation operators that are required to achieve a benefit from mutation
analysis, no existing mutation operator set is exhaustive. Therefore, if the
technique is to be used by software testers, then mutation speed and efficiency
should be chosen over the number or type of mutation operators. Is this
conclusion valid?
9.
Subodh Bakshi
10.
Samuel
Huang: Before analyzing the results of an
experiment using mutation testing, the decision of mutation operators seems of
prime importance. Given that there
are existing tools which have a large ``grab bag'' of different classes of
mutation operators, could we somehow correlate these operators with the ability
to reproduce classes of *faults*?
Formally this may take the form of clustering sets of mutation operators
with the knowledge of their fault-type causing tendencies. The relational clustering literature,
in particular applied to NLP type problems, does work on this, which is
sometimes referred to as biclustering or
co-clustering. Such a clustering
may allow us to better select _candidate_ mutation operators to satisfy some
notion of fault-type coverage, or perhaps to better approximate real-life faults
that are encountered.
11.
Ghanashyam Prabhakar
12.
Jonathan
Turpie
13.
Cho,
Hyoungtae:
14.
Hu,
Yuening
15.
Liu,
Ran:
16.
Ng
Zeng, Sonia Sauwua
17.
Sidhu, Tandeep
18.
Guerra
Gomez, John Alexis:
Student
Presentations: All students are strongly encouraged
to present a topic related to software testing. You must prepare slides and
select a date for your presentation. Group presentations are encouraged if the
selected topic is broad enough. Topics include, but are not limited to:
o
Combining and Comparing Testing
Techniques
o
Defect and Failure Estimation
and Analysis
o
Testing Embedded Software
o
Fault Injection
o
Load Testing
o
Testing for Security
o
Software Architectures and
Testing
o
Test Case Prioritization
o
Testing Concurrent Programs
o
Testing Database Applications
o
Testing Distributed Systems
o
Testing Evolving Software
o
Testing Interactive Systems
o
Testing Object-Oriented Software
o
Testing Spreadsheets
o
Testing vs. Other Quality
Assurance Techniques
o
Usability Testing
o
Web Testing
[NOTE:
This is the standard project for this class. If you don’t want to do this
project, you are welcome to propose your own. I have some project ideas too;
feel free to stop by and discuss. We will have to agree on a project together
with a grading policy.]
Summary:
GUITAR (http://guitar.sourceforge.net/)
is a system consisting of numerous applications. In this class project, we will
test four of GUITAR’s applications (JFCGUIRipper, GUIStructure2EFG, TestCaseGenerator,
JFCGUIReplayer) using conventional testing
techniques. We will also setup a continuous testing process for these
applications using the Hudson (http://hudson.dev.java.net/)
software. The black-box and white-box test cases, which you will design, for
this process will consist of (1) large system tests that cover most of the
functions of an application, (2) small system tests, each targeting a specific
part of the application; all these tests together cover most of the functions
of the application, and (3) Unit tests.
The project will consist of four phases. Phases 1 and 4 are
to be done individually; phases 2 and 3 are to be done in teams, each
consisting of 3-4 students.
Team |
Members |
Phase 1 Demo Date/Time |
Phase 2 Demo Date/Time One hour each |
Phase 3 Demo Date/Time One hour each |
Phase 4 |
JFCGUIRipper |
Nathaniel Crowell Amanda Crowell Christopher Hayden Rajiv Jain |
(09/10/09)/11:30AM (09/10/09)/11:00AM (09/11/09)/1:00PM (09/10/09)/1:00PM |
Thu.
10/15: 2:00pm Rajiv
Jain |
|
GUITARModel
JFCModel GUIRipper JFCRipper |
GUIStructure2EFG |
Shivsubramani
Krishnamoorthy Srividya
Ramaswamy Bryan Robbins Praveen Vaddadi |
(09/10/09)/12:00PM (09/10/09)/2:30PM (09/10/09)/12:00PM (09/10/09)/12:30PM |
Thu.
10/15: 3:00pm Praveen
Vaddadi |
|
GUIStructure2Graph-Plugins/GUIStructure2EFG (GUIReplayer
+ JFCReplayer) (GUIStructure2GraphConvert +
GUIStructure2Graph-Plugins/GUIStructure2EFG) GUIStructure2GraphConvert |
TestCaseGenerator |
Subodh
Bakshi Samuel Huang Ghanashyam
Prabhakar Jonathan Turpie |
(09/11/09)/1:30PM (09/11/09)/3:00PM (09/11/09)/3:30PM (09/11/09)/10:00AM |
Wed.
10/14: 3:00pm Jonathan
Turpie |
|
(TestCaseGenerator
+ RandomTestCase + SequenceLengthCoverage) (GUIRipper+JFCRipper) TestCaseGenerator-Plugins/RandomTestCase |
JFCGUIReplayer |
Cho, Hyoungtae Hu,
Yuening Liu, Ran Ng Zeng,
Sonia Sauwua Sidhu,
Tandeep |
(09/11/09)/10:00AM (09/11/09)/12:00PM (09/11/09)/12:30PM (09/10/09)/2:30PM (09/11/09)/9:30AM |
Wed.
10/14: 9:00am Ran
Liu |
|
(GUITARModel+JFCModel) TestCaseGenerator GUIReplayer JFCReplayer TestCaseGenerator-Plugins/SequenceLengthCoverage |
Independent |
Guerra Gomez, John Alexis |
(09/15/09)/11:00AM |
|
|
-NA- |
Phase
1
Goal:
Downloading and running all the applications.
Procedure:
In this phase, you need to demonstrate that you are able to run all four
applications (please see your course instructor for the inputs to these
applications). The source code for all these applications resides in a
Subversion repository that can be viewed at http://guitar.svn.sourceforge.net/viewvc/guitar/.
Please refer to https://sourceforge.net/projects/guitar/develop
for additional help with Subversion. Each application (lets call it foo for fun)
resides in the folder foo.tool/ in the repository. Checkout this
folder, for each application, and read its README.txt file. You will find more
information about the modules that foo uses for building and execution. You will also find
instructions on how to build and run foo.
Deliverables:
There are no deliverables for this phase. Points will be awarded for demos.
Grading
session: During the grading session, you will
have access to a new machine (Windows or Linux) that is connected to the
Internet. You will have to download, build, and execute all four applications
and run them on all the inputs. You will also need to install Ant, Subversion,
Java, and any other tools needed to run the applications. Each application is
worth 10 points.
Points:
40
Due
on or before Sep. 14 by 2:00pm. [Late submission policy – you lose 20% (of the maximum
points) per day]
Phase
2
Goal:
Setting up the continuous building and regression testing process using large
system tests.
Procedure:
This phase consists of working with Hudson to automate what you did in Phase 1.
In addition, you will evaluate the coverage of the inputs, document them as
test cases, and improve the inputs if needed. Form teams of 3-4 students and
take ownership of one of the applications.
You will need to write an Ant script
that automatically builds your application. Assume that the Ant script will be
run on a new Windows or Linux machine that has only Ant and Java installed on
it. Submit the Ant script for addition to the foo.tool/ folder in our Subversion repository. Configure Hudson to run this
Ant script periodically. Then use the input files from Phase 1 to develop large
system tests that will be stored in the foo.tool/tests/large-system-tests/
folder in the repository. Configure Hudson to run these tests periodically and
show the results of tests passing/failing. Configure Hudson to generate Emma
code coverage reports for these tests and make them available for viewing over
the web. Ensure that you get 100% statement coverage from these tests. Add more
large system tests to increase coverage if needed. Or provide evidence that
certain parts of the code cannot be executed using any system tests. Finally,
develop test oracles for reference testing against the application’s previous
build. Finally, document the test cases using the wiki tool – describe the
inputs and expected outputs.
Deliverables:
Your instructor will view the output of your tests, coverage reports, etc.
online on Hudson. The documentation should be submitted via the wiki pages. You
are advised to meet your instructor regularly (e.g., twice a week) to show
progress for each milestone (e.g., Team created, application selected, Ant
script done, Hudson configured to run Ant script, first system test automated,
wiki page started, etc.).
Grading
session: During the grading session, you will
need to show everything up and running on our Hudson server (30 points),
including the test and coverage reports. Your instructor will seed up to 10
artificial faults, one at a time, in the application under test. For each
individual seeded fault, your test cases will be re-run via Hudson to see if
they detect the fault. 10 points will be awarded for each fault detected by
your test cases, for a maximum of 50 points. The wiki documentation of the test
cases is worth 30 points.
Points:
110
Due
on or before Oct. 5 by 2:00pm. [Late submission policy – the entire team will lose 20% (of
the maximum points) per day]
Phase
3
Goal:
Designing small, targeted system tests and integrating them into the continuous
testing process.
Procedure:
This phase requires the development of new carefully crafted inputs for which
output can be predicted and created by hand. These new inputs will be very
different from the ones used in Phases 1 and 2 – you will see that the output
from the previous inputs was 100’s of MB, which was impossible to verify
manually. The new inputs developed in this phase will be tiny Java swing
programs that exercise select parts of your application. For each test, you
will manually develop test oracles that check whether the application performed
as expected.
Submit these tests so that they can be
stored in the foo.tool/tests/small-system-tests/ folder in the
repository. Configure Hudson to run these tests periodically and show the
results of tests passing/failing. Configure Hudson to generate Emma code
coverage reports for these tests and make them available for viewing over the
web. Ensure that you get 100% statement coverage from these tests. Add more
small system tests to increase coverage if needed. Or provide evidence that
certain parts of the code cannot be executed using any system tests. Finally,
document the test cases using the GUITAR wiki – describe the inputs and
expected outputs (this document is much more detailed than the one developed in
Phase 2).
Deliverables:
Your instructor will view the output of your tests, coverage reports, etc.
online on Hudson. The documentation should be submitted via the wiki pages. You
are advised to meet your instructor regularly (e.g., twice a week) to show
progress for each milestone.
Grading
session: During the grading session, you will
need to show everything up and running on our Hudson server (30 points),
including the test and coverage reports. Your instructor will then seed up to
15 artificial faults, one at a time, in the application under test. For each
individual seeded fault, your test cases will be re-run via Hudson to see if
they detect the fault. 10 points will be awarded for each fault detected by
your test cases, for a maximum of 100 points. The wiki documentation of the
test cases is worth 70 points.
Points:
200
Due
on or before Nov. 9 by 2:00pm. [Late submission policy – the entire team will lose 20% (of
the maximum points) per day]
Phase
4
Goal:
Designing and deploying unit tests.
Procedure:
So far, all your tests have been system tests. In this phase, you will develop
unit tests (e.g., using JUnit) for the modules used
by the tools. Remember that this phase is done individually, so select a module
quickly. For each test, you will develop inputs as well as test oracles (using assert statements in the test case)
that check whether the unit performed as expected on the inputs.
Submit these tests so that they can be
stored in the tests/ folder of the
relevant module in the repository. Configure Hudson to run these tests
periodically. Configure Hudson to generate Emma code coverage reports for these
tests and make them available for viewing over the web. Ensure that you get
100% statement AND BRANCH coverage (use Cobertura
http://cobertura.sourceforge.net/)
from these tests. Or provide evidence that certain parts of the code cannot be
executed using any unit tests. Finally, embed Javadoc
comments in the JUnit tests so that Javadoc can be generated automatically for them.
Deliverables:
Your instructor will view the output of your tests, coverage reports, etc.
online on Hudson. The documentation should be submitted via the Javadoc pages. You are advised to meet your instructor
regularly (e.g., twice a week) to show progress for each JUnit
test that you develop.
Grading
session: During the grading session, you will
need to show everything up and running on our Hudson server (30 points), including
the test and coverage reports. Your instructor will then seed up to 10
artificial faults, one at a time, in the module under test. For each individual
seeded fault, your test cases will be re-run via Hudson to see if they detect
the fault. 10 points will be awarded for each fault detected by your test
cases, for a maximum of 50 points. The Javadoc
documentation of the test cases is worth 70 points.
Points:
150
Due
on or before Dec. 1 by 2:00pm. [Late submission policy – you lose 20% (of the maximum points)
per day]