Several recent studies have reported the phenomenon of ``software aging'',
one in which the state of a software system degrades with time. This may
eventually lead to performance degradation of the software or crash/hang
failure or both. ``Software rejuvenation'' is a pro-active technique aimed
to prevent unexpected or unplanned outages due to aging. The basic idea
is to stop the running software, clean its internal state and restart it.
In this paper, we discuss software rejuvenation as applied to cluster
systems. This is both an innovative and an efficient way to improve
cluster system availability and productivity. Using Stochastic Reward
Nets (SRNs), we model and analyze cluster systems which employ software
rejuvenation. For our proposed time-based rejuvenation policy, we
determine the optimal rejuvenation interval based on system availability
and cost. We also introduce a new rejuvenation policy based on prediction
and show that it can dramatically increase system availability and reduce
downtime cost. These models are very general and can capture a multitude
of cluster system characteristics, failure behavior and performability
measures, which we are just beginning to explore. We then briefly describe
an implementation of a software rejuvenation system that performs periodic
and predictive rejuvenation, and show some empirical data from systems
that exhibit aging.
|